--- layout: post title: "Stochastic Gradient Descent in Continuous Time" date: 2023-09-28 --- Stochastic Gradient Descent in Continuous Time

Stochastic Gradient Descent in Continuous Time

Justin Sirignano* and Konstantinos Spiliopoulos ^(†‡){ }^{\dagger \ddagger}

October 31, 2017

Abstract

Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance. The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data. SGDCT performs an online parameter update in continuous time, with the parameter updates θ t θ t theta_(t)\theta_{t}θt satisfying a stochastic differential equation. We prove that lim t g ¯ ( θ t ) = 0 lim t g ¯ θ t = 0 lim_(t rarr oo)grad bar(g)(theta_(t))=0\lim _{t \rightarrow \infty} \nabla \bar{g}\left(\theta_{t}\right)=0limtg¯(θt)=0 where g ¯ g ¯ bar(g)\bar{g}g¯ is a natural objective function for the estimation of the continuous-time dynamics. The convergence proof leverages ergodicity by using an appropriate Poisson equation to help describe the evolution of the parameters for large times. For certain continuous-time problems, SGDCT has some promising advantages compared to a traditional stochastic gradient descent algorithm. This paper mainly focuses on applications in finance, such as model estimation for stocks, bonds, interest rates, and financial derivatives. SGDCT can also be used for the optimization of high-dimensional continuous-time models, such as American options. As an example application, SGDCT is combined with a deep neural network to price high-dimensional American options (up to 100 dimensions).

1. Introduction

This paper develops a statistical learning algorithm for continuous-time models, which are common in science, engineering, and finance. We study its theoretical convergence properties as well as its computational performance in a number of benchmark problems. Although the method is broadly applicable, this paper mainly focuses on applications in finance. Given a continuous stream of data, stochastic gradient descent in continuous time (SGDCT) can estimate unknown parameters or functions in stochastic differential equation (SDE) models for stocks, bonds, interest rates, and financial derivatives. The statistical learning algorithm can also be used for the optimization of high-dimensional continuous-time models, such as American options. High-dimensional American options have been a longstanding computational challenge in finance. SGDCT is able to accurately solve American options even in 100 dimensions.
Batch optimization for the statistical estimation of continuous-time models can be impractical for large datasets where observations occur over a long period of time. Batch optimization takes a sequence of descent steps for the model error for the entire observed data path. Since each descent step is for the model error for the entire observed data path, batch optimization is slow (sometimes impractically slow) for long periods of time or models which are computationally costly to evaluate (e.g., partial differential equations). Typical existing approaches in the financial statistics literature use batch optimization.
SGDCT provides a computationally efficient method for statistical learning over long time periods and for complex models. SGDCT continuously follows a (noisy) descent direction along the path of the observation; this results in much more rapid convergence. Parameters are updated online in continuous time, with the parameter updates θ t θ t theta_(t)\theta_{t}θt satisfying a stochastic differential equation. We prove that lim t g ¯ ( θ t ) = 0 lim t g ¯ θ t = 0 lim_(t rarr oo)grad bar(g)(theta_(t))=0\lim _{t \rightarrow \infty} \nabla \bar{g}\left(\theta_{t}\right)=0limtg¯(θt)=0 where g ¯ g ¯ bar(g)\bar{g}g¯ is a natural objective function for the estimation of the continuous-time dynamics.
Consider a diffusion X t X = R m X t X = R m X_(t)inX=R^(m)X_{t} \in \mathcal{X}=\mathbb{R}^{m}XtX=Rm :
d X t = f ( X t ) d t + σ d W t . d X t = f X t d t + σ d W t . dX_(t)=f^(**)(X_(t))dt+sigma dW_(t).d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma d W_{t} .dXt=f(Xt)dt+σdWt.
The goal is to statistically estimate a model f ( x , θ ) f ( x , θ ) f(x,theta)f(x, \theta)f(x,θ) for f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) where θ R n θ R n theta inR^(n)\theta \in \mathbb{R}^{n}θRn. The function f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) is unknown. W t R m W t R m W_(t)inR^(m)W_{t} \in \mathbb{R}^{m}WtRm is a standard Brownian motion. The diffusion term W t W t W_(t)W_{t}Wt represents any random behavior of the system or environment. The functions f ( x , θ ) f ( x , θ ) f(x,theta)f(x, \theta)f(x,θ) and f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) may be non-convex.
The stochastic gradient descent update in continuous time follows the SDE:
d θ t = α t [ θ f ( X t , θ t ) ( σ σ ) 1 d X t θ f ( X t , θ t ) ( σ σ ) 1 f ( X t , θ t ) d t ] , d θ t = α t θ f X t , θ t σ σ 1 d X t θ f X t , θ t σ σ 1 f X t , θ t d t , dtheta_(t)=alpha_(t)[grad_(theta)f(X_(t),theta_(t))(sigmasigma^(TT))^(-1)dX_(t)-grad_(theta)f(X_(t),theta_(t))(sigmasigma^(TT))^(-1)f(X_(t),theta_(t))dt],d \theta_{t}=\alpha_{t}\left[\nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left(\sigma \sigma^{\top}\right)^{-1} d X_{t}-\nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left(\sigma \sigma^{\top}\right)^{-1} f\left(X_{t}, \theta_{t}\right) d t\right],dθt=αt[θf(Xt,θt)(σσ)1dXtθf(Xt,θt)(σσ)1f(Xt,θt)dt],
where θ f ( X t ; θ t ) θ f X t ; θ t grad_(theta)f(X_(t);theta_(t))\nabla_{\theta} f\left(X_{t} ; \theta_{t}\right)θf(Xt;θt) is matrix valued and α t α t alpha_(t)\alpha_{t}αt is the learning rate. The parameter update 1.2 can be used for both statistical estimation given previously observed data as well as online learning (i.e., statistical estimation in real-time as data becomes available). SGDCT will still converge if σ σ σ σ sigmasigma^(TT)\sigma \sigma^{\top}σσ in 1.2 is replaced by the identity matrix I I III.
Using the proposed approach of this paper, the stochastic gradient descent algorithm (1.2) can also be generalized to the case where σ σ sigma\sigmaσ is a variable coefficient σ ( X t ) σ X t sigma^(**)(X_(t))\sigma^{*}\left(X_{t}\right)σ(Xt). In that case, a model σ ( x , ν ) σ ( x , ν ) sigma(x,nu)\sigma(x, \nu)σ(x,ν) is also learned for σ ( x ) σ ( x ) sigma^(**)(x)\sigma^{*}(x)σ(x) where ν R k ν R k nu inR^(k)\nu \in \mathbb{R}^{k}νRk is an additional set of parameters. (To be more precise, σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu)σ(x,ν)σ(x,ν) is learned for σ ( x ) σ , ( x ) σ ( x ) σ , ( x ) sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x)σ(x)σ,(x) since σ ( x ) σ ( x ) sigma^(**)(x)\sigma^{*}(x)σ(x) is not identifiable.) See Section 4 for details and the corresponding convergence proof.
We assume that X t X t X_(t)X_{t}Xt is sufficiently ergodic (to be concretely specified later in the paper) and that it has some well-behaved π ( d x ) π ( d x ) pi(dx)\pi(d x)π(dx) as its unique invariant measure. As a general notation, if h ( x , θ ) h ( x , θ ) h(x,theta)h(x, \theta)h(x,θ) is a generic L 1 ( π ) L 1 ( π ) L^(1)(pi)L^{1}(\pi)L1(π) function, then we define its average over π ( d x ) π ( d x ) pi(dx)\pi(d x)π(dx) to be
h ¯ ( θ ) = X h ( x , θ ) π ( d x ) . h ¯ ( θ ) = X h ( x , θ ) π ( d x ) . bar(h)(theta)=int_(X)h(x,theta)pi(dx).\bar{h}(\theta)=\int_{\mathcal{X}} h(x, \theta) \pi(d x) .h¯(θ)=Xh(x,θ)π(dx).
Let us set
g ( x , θ ) = 1 2 f ( x , θ ) f ( x ) σ σ 2 = 1 2 f ( x , θ ) f ( x ) , ( σ σ ) 1 ( f ( x , θ ) f ( x ) ) . g ( x , θ ) = 1 2 f ( x , θ ) f ( x ) σ σ 2 = 1 2 f ( x , θ ) f ( x ) , σ σ 1 f ( x , θ ) f ( x ) . g(x,theta)=(1)/(2)||f(x,theta)-f^(**)(x)||_(sigmasigma^(TT))^(2)=(1)/(2)(:f(x,theta)-f^(**)(x),(sigmasigma^(TT))^(-1)(f(x,theta)-f^(**)(x)):).g(x, \theta)=\frac{1}{2}\left\|f(x, \theta)-f^{*}(x)\right\|_{\sigma \sigma^{\top}}^{2}=\frac{1}{2}\left\langle f(x, \theta)-f^{*}(x),\left(\sigma \sigma^{\top}\right)^{-1}\left(f(x, \theta)-f^{*}(x)\right)\right\rangle .g(x,θ)=12f(x,θ)f(x)σσ2=12f(x,θ)f(x),(σσ)1(f(x,θ)f(x)).
The gradient θ g ( X t , θ ) θ g X t , θ grad_(theta)g(X_(t),theta)\nabla_{\theta} g\left(X_{t}, \theta\right)θg(Xt,θ) cannot be evaluated since f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) is unknown. However, d X t = f ( X t ) d t + σ d W t d X t = f X t d t + σ d W t dX_(t)=f^(**)(X_(t))dt+sigma dW_(t)d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma d W_{t}dXt=f(Xt)dt+σdWt is a noisy estimate of f ( x ) d t f ( x ) d t f^(**)(x)dtf^{*}(x) d tf(x)dt, which leads to the algorithm 1.2. SGDCT follows a noisy descent direction along a continuous stream of data produced by X t X t X_(t)X_{t}Xt.
Heuristically, it is expected that θ t θ t theta_(t)\theta_{t}θt will tend towards the minimum of the function g ¯ ( θ ) = X g ( x , θ ) π ( d x ) g ¯ ( θ ) = X g ( x , θ ) π ( d x ) bar(g)(theta)=int_(X)g(x,theta)pi(dx)\bar{g}(\theta)=\int_{\mathcal{X}} g(x, \theta) \pi(d x)g¯(θ)=Xg(x,θ)π(dx). The data X t X t X_(t)X_{t}Xt will be correlated over time, which complicates the mathematical analysis. This differs from the standard discrete-time version of stochastic gradient descent where the the data is usually considered to be i.i.d. at every step.

2. 1.1 1.1 1.1quad1.1 \quad1.1 Literature Review

In this paper we show that if α t α t alpha_(t)\alpha_{t}αt is appropriately chosen then g ¯ ( θ t ) 0 g ¯ θ t 0 grad bar(g)(theta_(t))rarr0\nabla \bar{g}\left(\theta_{t}\right) \rightarrow 0g¯(θt)0 as t t t rarr oot \rightarrow \inftyt with probability 1 (see Theorem 2.4). Results like this have been previously derived for stochastic gradient descent in discrete time; see [9] and [8]. 9] proves convergence in the absence of the X X XXX term. 8 , proves convergence of stochastic gradient descent in discrete time with the X X XXX process but requires stronger conditions than 9 .
Although stochastic gradient descent for discrete time has been extensively studied, stochastic gradient descent in continuous time has received relatively little attention. We refer readers to [8, 16] and [9] for a thorough review of the very large literature on stochastic gradient descent. There are also many algorithms which modify traditional stochastic gradient descent (stochastic gradient descent with momentum, Adagrad, RMSprop, etc.). For a review of these variants of stochastic gradient descent, see [15]. We mention below the prior work which is most relevant to our paper.
Our approach and assumptions required for convergence are most similar to [9], who prove convergence of discrete-time stochastic gradient descent in the absence of the X X XXX process. The presence of the X X XXX process is essential for considering a wide range of problems in continuous time, and showing convergence with its presence is considerably more difficult. The X X XXX term introduces correlation across times, and this correlation does not disappear as time tends to infinity. This makes it challenging to prove convergence in the continuoustime case. In order to prove convergence, we use an appropriate Poisson equation associated with X X XXX to describe the evolution of the parameters for large times.
21] proves, in a setting different than ours, convergence in L 2 L 2 L^(2)L^{2}L2 of projected stochastic gradient descent in discrete time for convex functions. In projected gradient descent, the parameters are projected back into an a priori chosen compact set. Therefore, the algorithm cannot hope to reach the minimum if the minimum is located outside of the chosen compact set. Of course, the compact set can be chosen to be very large for practical purposes. Our paper considers unconstrained stochastic gradient descent in continuous time and proves the almost sure convergence g ¯ ( θ t ) 0 g ¯ θ t 0 grad bar(g)(theta_(t))rarr0\nabla \bar{g}\left(\theta_{t}\right) \rightarrow 0g¯(θt)0 as t t t rarr oot \rightarrow \inftyt taking into account the X X XXX component as well. We do not assume any stability conditions on X X XXX (except that it is ergodic with a unique invariant measure).
Another approach for proving convergence of discrete-time stochastic gradient descent is to show that the algorithm converges to the solution of an ODE which itself converges to a limiting point. This is the approach of [8]. See also [16. This method, sometimes called the "ODE method", requires the assumption that the iterates (i.e., the model parameters which are being learned) remain in a bounded set with probability one. It is unclear whether the ODE method of proof can be successfully used to show convergence for a continuoustime stochastic gradient descent scheme. In this paper we follow a potentially more straightforward method of proof by analyzing the speed of convergence to equilibrium with an appropriately chosen Poisson type of equation.
25] studies continuous-time stochastic mirror descent in a setting different than ours. In the framework of [25], the objective function is known. In this paper, we consider the statistical estimation of the unknown dynamics of a random process (i.e. the X X XXX process satisfying (1.1p).
Statisticians and financial engineers have actively studied parameter estimation of SDEs, although typically not with statistical learning or machine learning approaches. The likelihood function will usually be calculated from the entire observed path of X X XXX (i.e., batch optimization) and then maximized to find the maximum likelihood estimator (MLE). Unlike in this paper, the actual optimization procedure to maximize the likelihood function is often not analyzed.
Some relevant publications in the financial statistics literature include [1, 2], 7], and [14. 7] derives the likelihood function for continuously observed X X XXX. The MLE can be calculated via batch optimization. 1] and 2] consider the case where X X XXX is discretely observed and calculate MLEs via a batch optimization approach. [14] estimates parameters by a Bayesian approach. Readers are referred to [10, 17, 26] for thorough reviews of classical statistical inference methods for stochastic differential equations.

2.1. Applications of SGDCT

Continuous-time models are especially common in finance. Given a continuous stream of data, the stochastic gradient descent algorithm can be used to estimate unknown parameters or functions in SDE models for stocks, bonds, interest rates, and financial derivatives. Numerical analysis of SGDCT for two common financial models is included in Sections 5.1, 5.2, and 5.5. The first is the well-known Ornstein-Uhlenbeck (OU) process (for examples in finance, see [19], [20, 29], and [18]). The second is the multidimensional CIR process which is a common model for interest rates (for examples in finance, see [4, [22], [11, [5], and [12]).
Scientific and engineering models are also typically in continuous-time. There are often coefficients or functions in these models which are uncertain or unknown; stochastic gradient descent can be used to learn these model parameters from data. In Section 5, we study the numerical performance for two example applications: Burger's equation and the classic reinforcement learning problem of balancing a pole on a moving cart. Burger's equation is a widely used nonlinear partial differential equation which is important to fluid mechanics, acoustics, and aerodynamics.
A natural question is why use SGDCT versus a straightforward approach which (1) discretizes the continuous-time dynamics and then (2) applies traditional stochastic gradient descent. For some of the same reasons that scientific models have been largely developed in continuous time, it can be advantageous to develop continuous-time statistical learning for continuous-time models.
SGDCT allows for the application of numerical schemes of choice to the theoretically correct statistical learning equation for continuous-time models. This can lead to more accurate and more computationally efficient parameter updates. Numerical schemes are always applied to continuous-time dynamics and different numerical schemes may have different properties for different continuous-time models. A priori performing a discretization to the system dynamics and then applying a traditional discrete-time stochastic gradient descent scheme can result in a loss of accuracy. For example, there is no guarantee that (1) using a higherorder accurate scheme to discretize the system dynamics and then (2) applying traditional stochastic gradient descent will produce a statistical learning scheme which is higher-order accurate in time. Hence, it makes sense to first develop the continuous-time statistical learning equation, and then apply the higher-order accurate numerical scheme.
Besides model estimation, SGDCT can be used to solve continuous-time optimization problems, such as American options. We combine SGDCT with a deep neural network to solve American options in up to 100 dimensions (see Section 6). An alternative approach would be to discretize the dynamics and then use the Q-learning algorithm (traditional stochastic gradient descent applied to an approximation of the discrete HJB equation). However, Q-learning is biased while SGDCT is unbiased. Furthermore, in SDE models with Brownian motions, the Q-learning algorithm can blow up as the time step size Δ Δ Delta\DeltaΔ becomes small; see Section 6 for details.
The convergence issue with Q-learning highlights the importance of studying continuous-time algorithms for continuous-time models. It is of interest to show that (1) a discrete-time scheme converges to an appropriate continuous-time scheme as Δ 0 Δ 0 Delta rarr0\Delta \rightarrow 0Δ0 and (2) the continuous-time scheme converges to the correct estimate as t t t rarr oot \rightarrow \inftyt. These are important questions since any discrete scheme for a continuous-time model incurs some error proportional to Δ Δ Delta\DeltaΔ, and therefore Δ Δ Delta\DeltaΔ must be decreased to reduce error. It is also important to note that in some cases, such as Q-learning, computationally expensive terms in the discrete algorithm (such as expectations over high-dimensional spaces) may become much simpler expressions in the continuous-time scheme (differential operators).

2.2. Organization of Paper

The paper is organized into five main sections. Section 2 presents the assumption and the main theorem. In Section 3 we prove the main result of this paper for the convergence of continuous-time stochastic gradient descent. The extension of the stochastic gradient descent algorithm to the case of a variable diffusion coefficient function is described in Section 4 . Section 5 provides numerical analysis of SGDCT for model estimation in several applications. Section 6 discusses SGDCT for solving continuous-time optimization problems, particularly focusing on American options.

3. Assumptions and Main Result

Before presenting the main result of this paper, Theorem 2.4, let us elaborate on the standing assumptions. In regards to the learning rate α t α t alpha_(t)\alpha_{t}αt the standing assumption is
Condition 2.1. Assume that 0 α t d t = , 0 α t 2 d t < , 0 | α s | d s < 0 α t d t = , 0 α t 2 d t < , 0 α s d s < int_(0)^(oo)alpha_(t)dt=oo,int_(0)^(oo)alpha_(t)^(2)dt < oo,int_(0)^(oo)|alpha_(s)^(')|ds < oo\int_{0}^{\infty} \alpha_{t} d t=\infty, \int_{0}^{\infty} \alpha_{t}^{2} d t<\infty, \int_{0}^{\infty}\left|\alpha_{s}^{\prime}\right| d s<\infty0αtdt=,0αt2dt<,0|αs|ds< and that there is a p > 0 p > 0 p > 0p>0p>0 such that lim t α t 2 t 1 / 2 + 2 p = 0 lim t α t 2 t 1 / 2 + 2 p = 0 lim_(t rarr oo)alpha_(t)^(2)t^(1//2+2p)=0\lim _{t \rightarrow \infty} \alpha_{t}^{2} t^{1 / 2+2 p}=0limtαt2t1/2+2p=0.
A standard choice for α t α t alpha_(t)\alpha_{t}αt that satisfies Condition 2.1 is α t = 1 C + t α t = 1 C + t alpha_(t)=(1)/(C+t)\alpha_{t}=\frac{1}{C+t}αt=1C+t for some constant 0 < C < 0 < C < 0 < C < oo0<C<\infty0<C<. Notice that the condition 0 | α s | d s < 0 α s d s < int_(0)^(oo)|alpha_(s)^(')|ds < oo\int_{0}^{\infty}\left|\alpha_{s}^{\prime}\right| d s<\infty0|αs|ds< follows immediately from the other two restrictions for the learning rate if it is chosen to be a monotonic function of t t ttt.
Let us next discuss the assumptions that we impose on σ , f ( x ) σ , f ( x ) sigma,f^(**)(x)\sigma, f^{*}(x)σ,f(x) and f ( x , θ ) f ( x , θ ) f(x,theta)f(x, \theta)f(x,θ). Condition 2.2 guarantees uniqueness and existence of an invariant measure for the X X XXX process.
Condition 2.2. We assume that σ σ σ σ sigmasigma^(TT)\sigma \sigma^{\top}σσ is non-degenerate bounded diffusion matrix and lim | x | f ( x ) x = lim | x | f ( x ) x = lim_(|x|rarr oo)f^(**)(x)*x=-oo\lim _{|x| \rightarrow \infty} f^{*}(x) \cdot x=-\inftylim|x|f(x)x=
In addition, with respect to θ f ( x , θ ) θ f ( x , θ ) grad_(theta)f(x,theta)\nabla_{\theta} f(x, \theta)θf(x,θ) we assume that θ R n θ R n theta inR^(n)\theta \in \mathbb{R}^{n}θRn and we impose the following condition
Condition 2.3. 1. We assume that θ g ( x , ) C 2 ( R n ) θ g ( x , ) C 2 R n grad_(theta)g(x,*)inC^(2)(R^(n))\nabla_{\theta} g(x, \cdot) \in C^{2}\left(\mathbb{R}^{n}\right)θg(x,)C2(Rn) for all x X , 2 θ g x 2 C ( X , R n ) , θ g ( , θ ) x X , 2 θ g x 2 C X , R n , θ g ( , θ ) x inX,(del^(2)grad_(theta)g)/(delx^(2))in C(X,R^(n)),grad_(theta)g(*,theta)inx \in \mathcal{X}, \frac{\partial^{2} \nabla_{\theta} g}{\partial x^{2}} \in C\left(\mathcal{X}, \mathbb{R}^{n}\right), \nabla_{\theta} g(\cdot, \theta) \inxX,2θgx2C(X,Rn),θg(,θ) C α ( X ) C α ( X ) C^(alpha)(X)C^{\alpha}(\mathcal{X})Cα(X) uniformly in θ R n θ R n theta inR^(n)\theta \in \mathbb{R}^{n}θRn for some α ( 0 , 1 ) α ( 0 , 1 ) alpha in(0,1)\alpha \in(0,1)α(0,1) and that there exist K K KKK and q q qqq such that
i = 0 2 | i θ g θ i ( x , θ ) | K ( 1 + | x | q ) . i = 0 2 i θ g θ i ( x , θ ) K 1 + | x | q . sum_(i=0)^(2)|(del^(i)grad_(theta)g)/(deltheta^(i))(x,theta)| <= K(1+|x|^(q)).\sum_{i=0}^{2}\left|\frac{\partial^{i} \nabla_{\theta} g}{\partial \theta^{i}}(x, \theta)\right| \leq K\left(1+|x|^{q}\right) .i=02|iθgθi(x,θ)|K(1+|x|q).
  1. For every N > 0 N > 0 N > 0N>0N>0 there exists a constant C ( N ) C ( N ) C(N)C(N)C(N) such that for all θ 1 , θ 2 R n θ 1 , θ 2 R n theta_(1),theta_(2)inR^(n)\theta_{1}, \theta_{2} \in \mathbb{R}^{n}θ1,θ2Rn and | x | N | x | N |x| <= N|x| \leq N|x|N, the diffusion coefficient θ f θ f grad_(theta)f\nabla_{\theta} fθf satisfies
| θ f ( x , θ 1 ) θ f ( x , θ 2 ) | C ( N ) | θ 1 θ 2 | . θ f x , θ 1 θ f x , θ 2 C ( N ) θ 1 θ 2 . |grad_(theta)f(x,theta_(1))-grad_(theta)f(x,theta_(2))| <= C(N)|theta_(1)-theta_(2)|.\left|\nabla_{\theta} f\left(x, \theta_{1}\right)-\nabla_{\theta} f\left(x, \theta_{2}\right)\right| \leq C(N)\left|\theta_{1}-\theta_{2}\right| .|θf(x,θ1)θf(x,θ2)|C(N)|θ1θ2|.
Moreover, there exists K > 0 K > 0 K > 0K>0K>0 and q > 0 q > 0 q > 0q>0q>0 such that
| θ f ( x , θ ) | K ( 1 + | x | q ) . θ f ( x , θ ) K 1 + | x | q . |grad_(theta)f(x,theta)| <= K(1+|x|^(q)).\left|\nabla_{\theta} f(x, \theta)\right| \leq K\left(1+|x|^{q}\right) .|θf(x,θ)|K(1+|x|q).
  1. The function f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) is C 2 + α ( X ) C 2 + α ( X ) C^(2+alpha)(X)C^{2+\alpha}(\mathcal{X})C2+α(X) with α ( 0 , 1 ) α ( 0 , 1 ) alpha in(0,1)\alpha \in(0,1)α(0,1). Namely, it has two derivatives in x x xxx, with all partial derivatives being Hölder continuous, with exponent α α alpha\alphaα, with respect to x x xxx.
Condition 2.3 allows one to control the ergodic behavior of the X process. As will be seen from the proof of the main convergence result Theorem 2.4. one needs to control terms of the form 0 t α t ( g ¯ ( θ s ) g ( X s , θ s ) ) d s 0 t α t g ¯ θ s g X s , θ s d s int_(0)^(t)alpha_(t)(grad( bar(g))(theta_(s))-g(X_(s),theta_(s)))ds\int_{0}^{t} \alpha_{t}\left(\nabla \bar{g}\left(\theta_{s}\right)-g\left(X_{s}, \theta_{s}\right)\right) d s0tαt(g¯(θs)g(Xs,θs))ds. Due to ergodicity of the X X XXX process one expects that such terms are small in magnitude and go to zero as t t t rarr oot \rightarrow \inftyt. However, the speed at which they go to zero is what matters here. We treat such terms by rewriting them equivalently using appropriate Poisson type partial differential equations (PDE). Condition 2.3 guarantees that these Poisson equations have unique solutions that do not grow faster than polynomially in the x x xxx variable (see Theorem A.1 in Appendix A).
The main result of this paper is Theorem 2.4 .
Theorem 2.4. Assume that Conditions 2.1, 2.2 and 2.3 hold. Then we have that
lim t g ¯ ( θ t ) = 0 , almost surely. lim t g ¯ θ t = 0 ,  almost surely.  lim_(t rarr oo)||grad( bar(g))(theta_(t))||=0," almost surely. "\lim _{t \rightarrow \infty}\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|=0, \text { almost surely. }limtg¯(θt)=0, almost surely. 

4. Proof of Theorem 2.4

We proceed in a spirit similar to that of [9]. However, apart from continuous versus discrete dynamics, one of the main challenges of the proof here is the presence of the ergodic X X XXX process. Let us consider an arbitrarily given κ > 0 κ > 0 kappa > 0\kappa>0κ>0 and λ = λ ( κ ) > 0 λ = λ ( κ ) > 0 lambda=lambda(kappa) > 0\lambda=\lambda(\kappa)>0λ=λ(κ)>0 to be chosen. Then set σ 0 = 0 σ 0 = 0 sigma_(0)=0\sigma_{0}=0σ0=0 and consider the cycles of random times
0 = σ 0 τ 1 σ 1 τ 2 σ 2 0 = σ 0 τ 1 σ 1 τ 2 σ 2 0=sigma_(0) <= tau_(1) <= sigma_(1) <= tau_(2) <= sigma_(2) <= dots0=\sigma_{0} \leq \tau_{1} \leq \sigma_{1} \leq \tau_{2} \leq \sigma_{2} \leq \ldots0=σ0τ1σ1τ2σ2
where for k = 1 , 2 , k = 1 , 2 , k=1,2,cdotsk=1,2, \cdotsk=1,2,
τ k = inf { t > σ k 1 : g ¯ ( θ t ) κ } , σ k = sup { t > τ k : g ¯ ( θ τ k ) 2 g ¯ ( θ s ) 2 g ¯ ( θ τ k ) for all s [ τ k , t ] and τ k t α s d s λ } . τ k = inf t > σ k 1 : g ¯ θ t κ , σ k = sup t > τ k : g ¯ θ τ k 2 g ¯ θ s 2 g ¯ θ τ k  for all  s τ k , t  and  τ k t α s d s λ . {:[tau_(k)=i n f{t > sigma_(k-1):||grad( bar(g))(theta_(t))|| >= kappa}","],[sigma_(k)=s u p{t > tau_(k):(||grad( bar(g))(theta_(tau_(k)))||)/(2) <= ||grad( bar(g))(theta_(s))|| <= 2||grad( bar(g))(theta_(tau_(k)))||" for all "s in[tau_(k),t]" and "int_(tau_(k))^(t)alpha_(s)ds <= lambda}.]:}\begin{aligned} \tau_{k} & =\inf \left\{t>\sigma_{k-1}:\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \geq \kappa\right\}, \\ \sigma_{k} & =\sup \left\{t>\tau_{k}: \frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} \leq\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leq 2\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \text { for all } s \in\left[\tau_{k}, t\right] \text { and } \int_{\tau_{k}}^{t} \alpha_{s} d s \leq \lambda\right\} . \end{aligned}τk=inf{t>σk1:g¯(θt)κ},σk=sup{t>τk:g¯(θτk)2g¯(θs)2g¯(θτk) for all s[τk,t] and τktαsdsλ}.
The purpose of these random times is to control the periods of time where g ¯ ( θ g ¯ ( θ ||grad bar(g)(theta\| \nabla \bar{g}(\thetag¯(θ.) ||\| is close to zero and away from zero. Let us next define the random time intervals J k = [ σ k 1 , τ k ) J k = σ k 1 , τ k J_(k)=[sigma_(k-1),tau_(k))J_{k}=\left[\sigma_{k-1}, \tau_{k}\right)Jk=[σk1,τk) and I k = [ τ k , σ k ) I k = τ k , σ k I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right)Ik=[τk,σk). Notice that for every t J k t J k t inJ_(k)t \in J_{k}tJk we have g ¯ ( θ t ) < κ g ¯ θ t < κ ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappag¯(θt)<κ.
Let us next consider some η > 0 η > 0 eta > 0\eta>0η>0 sufficiently small to be chosen later on and set σ k , η = σ k + η σ k , η = σ k + η sigma_(k,eta)=sigma_(k)+eta\sigma_{k, \eta}=\sigma_{k}+\etaσk,η=σk+η. Lemma 3.1 is crucial for the proof of Theorem 2.4.
Lemma 3.1. Assume that Conditions 2.1, 2.2 and 2.3 hold. Let us set
Γ k , η = τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s . Γ k , η = τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s . Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds.\Gamma_{k, \eta}=\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s .Γk,η=τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds.
Then, with probability one we have that
Γ k , η 0 , as k . Γ k , η 0 , as  k . ||Gamma_(k,eta)||rarr0", as "k rarr oo.\left\|\Gamma_{k, \eta}\right\| \rightarrow 0 \text {, as } k \rightarrow \infty .Γk,η0, as k.
Proof. The idea is to use Theorem A.1 in order to get an equivalent expression for the term Γ k , η Γ k , η Gamma_(k,eta)\Gamma_{k, \eta}Γk,η that we seek to control.
Let us consider the function G ( x , θ ) = θ g ( x , θ ) θ g ¯ ( θ ) G ( x , θ ) = θ g ( x , θ ) θ g ¯ ( θ ) G(x,theta)=grad_(theta)g(x,theta)-grad_(theta) bar(g)(theta)G(x, \theta)=\nabla_{\theta} g(x, \theta)-\nabla_{\theta} \bar{g}(\theta)G(x,θ)=θg(x,θ)θg¯(θ). Notice that by definition and due to Condition 2.3 the function G ( x , θ ) G ( x , θ ) G(x,theta)G(x, \theta)G(x,θ) satisfies the centering condition A.1 of Theorem A.1 componentwise. So, the Poisson equation A.2 will have a unique smooth solution, denoted by v ( x , θ ) v ( x , θ ) v(x,theta)v(x, \theta)v(x,θ) that grows at most polynomially in x x xxx. Let us apply Itô formula to the vector valued function u ( t , x , θ ) = α t v ( x , θ ) u ( t , x , θ ) = α t v ( x , θ ) u(t,x,theta)=alpha_(t)v(x,theta)u(t, x, \theta)=\alpha_{t} v(x, \theta)u(t,x,θ)=αtv(x,θ). Doing so, we get for i = 1 , , n i = 1 , , n i=1,cdots,ni=1, \cdots, ni=1,,n
u i ( σ , X σ , θ σ ) u i ( τ , X τ , θ τ ) = τ σ s u i ( s , X s , θ s ) d s + τ σ L x u i ( s , X s , θ s ) d s + τ σ L θ u i ( s , X s , θ s ) d s + τ σ α s tr [ θ f ( X s , θ s ) x θ u i ( s , X s , θ s ) ] d s + τ σ x u i ( s , X s , θ s ) , σ d W s + τ σ α s θ u i ( s , X s , θ s ) , θ f ( X s , θ s ) σ 1 d W s , u i σ , X σ , θ σ u i τ , X τ , θ τ = τ σ s u i s , X s , θ s d s + τ σ L x u i s , X s , θ s d s + τ σ L θ u i s , X s , θ s d s + τ σ α s tr θ f X s , θ s x θ u i s , X s , θ s d s + τ σ x u i s , X s , θ s , σ d W s + τ σ α s θ u i s , X s , θ s , θ f X s , θ s σ 1 d W s , {:[u_(i)(sigma,X_(sigma),theta_(sigma))-u_(i)(tau,X_(tau),theta_(tau))=int_(tau)^(sigma)del_(s)u_(i)(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(x)u_(i)(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(theta)u_(i)(s,X_(s),theta_(s))ds],[+int_(tau)^(sigma)alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)u_(i)(s,X_(s),theta_(s))]ds],[+int_(tau)^(sigma)(:grad_(x)u_(i)(s,X_(s),theta_(s)),sigma dW_(s):)+int_(tau)^(sigma)alpha_(s)(:grad_(theta)u_(i)(s,X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)","]:}\begin{aligned} u_{i}\left(\sigma, X_{\sigma}, \theta_{\sigma}\right)-u_{i}\left(\tau, X_{\tau}, \theta_{\tau}\right) & =\int_{\tau}^{\sigma} \partial_{s} u_{i}\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{x} u_{i}\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right) d s \\ & +\int_{\tau}^{\sigma} \alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right)\right] d s \\ & +\int_{\tau}^{\sigma}\left\langle\nabla_{x} u_{i}\left(s, X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle, \end{aligned}ui(σ,Xσ,θσ)ui(τ,Xτ,θτ)=τσsui(s,Xs,θs)ds+τσLxui(s,Xs,θs)ds+τσLθui(s,Xs,θs)ds+τσαstr[θf(Xs,θs)xθui(s,Xs,θs)]ds+τσxui(s,Xs,θs),σdWs+τσαsθui(s,Xs,θs),θf(Xs,θs)σ1dWs,
where L x L x L_(x)\mathcal{L}_{x}Lx and L θ L θ L_(theta)\mathcal{L}_{\theta}Lθ denote the infinitesimal generators for processes X X XXX and θ θ theta\thetaθ respectively.
Recall now that v ( x , θ ) v ( x , θ ) v(x,theta)v(x, \theta)v(x,θ) is the solution to the given Poisson equation and that u ( s , x , θ ) = α s v ( x , θ ) u ( s , x , θ ) = α s v ( x , θ ) u(s,x,theta)=alpha_(s)v(x,theta)u(s, x, \theta)=\alpha_{s} v(x, \theta)u(s,x,θ)=αsv(x,θ). Using these facts and rearranging the previous Itô formula, we get in vector notation
Γ k , η = τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s = τ k σ k , η L x u ( s , X s , θ s ) d s = [ α σ k , η v ( X σ k , η , θ σ k , η ) α τ k v ( X τ k , θ τ k ) τ k σ k , η s α s v ( X s , θ s ) d s ] τ k σ k , η α s [ L θ v ( X s , θ s ) + α s tr [ θ f ( X s , θ s ) x i θ v ( X s , θ s ) ] i = 1 m ] d s τ k σ k , η α s x v ( X s , θ s ) , σ d W s τ k σ k , η α s 2 θ v ( X s , θ s ) , θ f ( X s , θ s ) σ 1 d W s . Γ k , η = τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s = τ k σ k , η L x u s , X s , θ s d s = α σ k , η v X σ k , η , θ σ k , η α τ k v X τ k , θ τ k τ k σ k , η s α s v X s , θ s d s τ k σ k , η α s L θ v X s , θ s + α s tr θ f X s , θ s x i θ v X s , θ s i = 1 m d s τ k σ k , η α s x v X s , θ s , σ d W s τ k σ k , η α s 2 θ v X s , θ s , θ f X s , θ s σ 1 d W s . {:[Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds=int_(tau_(k))^(sigma_(k,eta))L_(x)u(s,X_(s),theta_(s))ds],[=[alpha_(sigma_(k,eta))v(X_(sigma_(k,eta)),theta_(sigma_(k,eta)))-alpha_(tau_(k))v(X_(tau_(k)),theta_(tau_(k)))-int_(tau_(k))^(sigma_(k,eta))del_(s)alpha_(s)v(X_(s),theta_(s))ds]],[-int_(tau_(k))^(sigma_(k,eta))alpha_(s)[L_(theta)v(X_(s),theta_(s))+alpha_(s)tr [grad_(theta)f(X_(s),theta_(s))grad_(x_(i))grad_(theta)v(X_(s),theta_(s))]_(i=1)^(m)]ds],[-int_(tau_(k))^(sigma_(k,eta))alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)-int_(tau_(k))^(sigma_(k,eta))alpha_(s)^(2)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned} \Gamma_{k, \eta}= & \int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s=\int_{\tau_{k}}^{\sigma_{k, \eta}} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s \\ = & {\left[\alpha_{\sigma_{k, \eta}} v\left(X_{\sigma_{k, \eta}}, \theta_{\sigma_{k, \eta}}\right)-\alpha_{\tau_{k}} v\left(X_{\tau_{k}}, \theta_{\tau_{k}}\right)-\int_{\tau_{k}}^{\sigma_{k, \eta}} \partial_{s} \alpha_{s} v\left(X_{s}, \theta_{s}\right) d s\right] } \\ & -\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left[\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x_{i}} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]_{i=1}^{m}\right] d s \\ & -\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle-\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}^{2}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle . \end{aligned}Γk,η=τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds=τkσk,ηLxu(s,Xs,θs)ds=[ασk,ηv(Xσk,η,θσk,η)ατkv(Xτk,θτk)τkσk,ηsαsv(Xs,θs)ds]τkσk,ηαs[Lθv(Xs,θs)+αstr[θf(Xs,θs)xiθv(Xs,θs)]i=1m]dsτkσk,ηαsxv(Xs,θs),σdWsτkσk,ηαs2θv(Xs,θs),θf(Xs,θs)σ1dWs.
The next step is to treat each term on the right hand side of (3.1) separately. For this purpose, let us first set
J t ( 1 ) = α t sup s [ 0 , t ] v ( X s , θ s ) J t ( 1 ) = α t sup s [ 0 , t ] v X s , θ s J_(t)^((1))=alpha_(t)s u p_(s in[0,t])||v(X_(s),theta_(s))||J_{t}^{(1)}=\alpha_{t} \sup _{s \in[0, t]}\left\|v\left(X_{s}, \theta_{s}\right)\right\|Jt(1)=αtsups[0,t]v(Xs,θs)
By Theorem A.1 and Proposition 2 of [23] there is some 0 < K < 0 < K < 0 < K < oo0<K<\infty0<K< (that may change from line to line below) and 0 < q < 0 < q < 0 < q < oo0<q<\infty0<q< such that for t t ttt large enough
E | J t ( 1 ) | 2 K α t 2 E [ 1 + sup s [ 0 , t ] X s q ] = K α t 2 [ 1 + t E sup s [ 0 , t ] X s q t ] K α t 2 [ 1 + t ] K α t 2 t E J t ( 1 ) 2 K α t 2 E 1 + sup s [ 0 , t ] X s q = K α t 2 1 + t E sup s [ 0 , t ] X s q t K α t 2 [ 1 + t ] K α t 2 t {:[E|J_(t)^((1))|^(2) <= Kalpha_(t)^(2)E[1+s u p_(s in[0,t])||X_(s)||^(q)]=Kalpha_(t)^(2)[1+sqrtt(Es u p_(s in[0,t])||X_(s)||^(q))/(sqrtt)]],[ <= Kalpha_(t)^(2)[1+sqrtt] <= Kalpha_(t)^(2)sqrtt]:}\begin{aligned} \mathbb{E}\left|J_{t}^{(1)}\right|^{2} & \leq K \alpha_{t}^{2} \mathbb{E}\left[1+\sup _{s \in[0, t]}\left\|X_{s}\right\|^{q}\right]=K \alpha_{t}^{2}\left[1+\sqrt{t} \frac{\mathbb{E} \sup _{s \in[0, t]}\left\|X_{s}\right\|^{q}}{\sqrt{t}}\right] \\ & \leq K \alpha_{t}^{2}[1+\sqrt{t}] \leq K \alpha_{t}^{2} \sqrt{t} \end{aligned}E|Jt(1)|2Kαt2E[1+sups[0,t]Xsq]=Kαt2[1+tEsups[0,t]Xsqt]Kαt2[1+t]Kαt2t
By Condition 2.1 let us consider p > 0 p > 0 p > 0p>0p>0 such that lim t α t 2 t 1 / 2 + 2 p = 0 lim t α t 2 t 1 / 2 + 2 p = 0 lim_(t rarr oo)alpha_(t)^(2)t^(1//2+2p)=0\lim _{t \rightarrow \infty} \alpha_{t}^{2} t^{1 / 2+2 p}=0limtαt2t1/2+2p=0 and for any δ ( 0 , p ) δ ( 0 , p ) delta in(0,p)\delta \in(0, p)δ(0,p) define the event A t , δ = { J t ( 1 ) t δ p } A t , δ = J t ( 1 ) t δ p A_(t,delta)={J_(t)^((1)) >= t^(delta-p)}A_{t, \delta}=\left\{J_{t}^{(1)} \geq t^{\delta-p}\right\}At,δ={Jt(1)tδp}. Then we have for t t ttt large enough such that α t 2 t 1 / 2 + 2 p 1 α t 2 t 1 / 2 + 2 p 1 alpha_(t)^(2)t^(1//2+2p) <= 1\alpha_{t}^{2} t^{1 / 2+2 p} \leq 1αt2t1/2+2p1
P ( A t , δ ) E | J t ( 1 ) | 2 t 2 ( δ p ) K α t 2 t 1 / 2 + 2 p t 2 δ K 1 t 2 δ P A t , δ E J t ( 1 ) 2 t 2 ( δ p ) K α t 2 t 1 / 2 + 2 p t 2 δ K 1 t 2 δ P(A_(t,delta)) <= (E|J_(t)^((1))|^(2))/(t^(2(delta-p))) <= K(alpha_(t)^(2)t^(1//2+2p))/(t^(2delta)) <= K(1)/(t^(2delta))\mathbb{P}\left(A_{t, \delta}\right) \leq \frac{\mathbb{E}\left|J_{t}^{(1)}\right|^{2}}{t^{2(\delta-p)}} \leq K \frac{\alpha_{t}^{2} t^{1 / 2+2 p}}{t^{2 \delta}} \leq K \frac{1}{t^{2 \delta}}P(At,δ)E|Jt(1)|2t2(δp)Kαt2t1/2+2pt2δK1t2δ
The latter implies that
n N P ( A 2 n , δ ) < n N P A 2 n , δ < sum_(n inN)P(A_(2^(n),delta)) < oo\sum_{n \in \mathbb{N}} \mathbb{P}\left(A_{2^{n}, \delta}\right)<\inftynNP(A2n,δ)<
Therefore, by Borel-Cantelli lemma we have that for every δ ( 0 , p ) δ ( 0 , p ) delta in(0,p)\delta \in(0, p)δ(0,p) there is a finite positive random variable d ( ω ) d ( ω ) d(omega)d(\omega)d(ω) and some n 0 < n 0 < n_(0) < oon_{0}<\inftyn0< such that for every n n 0 n n 0 n >= n_(0)n \geq n_{0}nn0 one has
J 2 n ( 1 ) d ( ω ) 2 n ( p δ ) J 2 n ( 1 ) d ( ω ) 2 n ( p δ ) J_(2^(n))^((1)) <= (d(omega))/(2^(n(p-delta)))J_{2^{n}}^{(1)} \leq \frac{d(\omega)}{2^{n(p-\delta)}}J2n(1)d(ω)2n(pδ)
Thus for t [ 2 n , 2 n + 1 ) t 2 n , 2 n + 1 t in[2^(n),2^(n+1))t \in\left[2^{n}, 2^{n+1}\right)t[2n,2n+1) and n n 0 n n 0 n >= n_(0)n \geq n_{0}nn0 one has for some finite constant K < K < K < ooK<\inftyK<
J t ( 1 ) K α 2 n + 1 sup s ( 0 , 2 n + 1 ] v ( X s , θ s ) K d ( ω ) 2 ( n + 1 ) ( p δ ) K d ( ω ) t p δ . J t ( 1 ) K α 2 n + 1 sup s 0 , 2 n + 1 v X s , θ s K d ( ω ) 2 ( n + 1 ) ( p δ ) K d ( ω ) t p δ . J_(t)^((1)) <= Kalpha_(2^(n+1))s u p_(s in(0,2^(n+1)])||v(X_(s),theta_(s))|| <= K(d(omega))/(2^((n+1)(p-delta))) <= K(d(omega))/(t^(p-delta)).J_{t}^{(1)} \leq K \alpha_{2^{n+1}} \sup _{s \in\left(0,2^{n+1}\right]}\left\|v\left(X_{s}, \theta_{s}\right)\right\| \leq K \frac{d(\omega)}{2^{(n+1)(p-\delta)}} \leq K \frac{d(\omega)}{t^{p-\delta}} .Jt(1)Kα2n+1sups(0,2n+1]v(Xs,θs)Kd(ω)2(n+1)(pδ)Kd(ω)tpδ.
The latter display then guarantees that for t 2 n 0 t 2 n 0 t >= 2^(n_(0))t \geq 2^{n_{0}}t2n0 we have with probability one
J t ( 1 ) K d ( ω ) t p δ 0 , as t J t ( 1 ) K d ( ω ) t p δ 0 ,  as  t J_(t)^((1)) <= K(d(omega))/(t^(p-delta))rarr0," as "t rarr ooJ_{t}^{(1)} \leq K \frac{d(\omega)}{t^{p-\delta}} \rightarrow 0, \text { as } t \rightarrow \inftyJt(1)Kd(ω)tpδ0, as t
Next we consider the term
J t , 0 ( 2 ) = 0 t α s v ( X s , θ s ) + α s ( L θ v ( X s , θ s ) + α s tr [ θ f ( X s , θ s ) x i θ v ( X s , θ s ) ] i = 1 m ) d s . J t , 0 ( 2 ) = 0 t α s v X s , θ s + α s L θ v X s , θ s + α s tr θ f X s , θ s x i θ v X s , θ s i = 1 m d s . J_(t,0)^((2))=int_(0)^(t)||alpha_(s)^(')v(X_(s),theta_(s))+alpha_(s)(L_(theta)v(X_(s),theta_(s))+alpha_(s)tr [grad_(theta)f(X_(s),theta_(s))grad_(x_(i))grad_(theta)v(X_(s),theta_(s))]_(i=1)^(m))||ds.J_{t, 0}^{(2)}=\int_{0}^{t}\left\|\alpha_{s}^{\prime} v\left(X_{s}, \theta_{s}\right)+\alpha_{s}\left(\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x_{i}} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]_{i=1}^{m}\right)\right\| d s .Jt,0(2)=0tαsv(Xs,θs)+αs(Lθv(Xs,θs)+αstr[θf(Xs,θs)xiθv(Xs,θs)]i=1m)ds.
By the bounds of Theorem A.1 we see that there are constants 0 < K < 0 < K < 0 < K < oo0<K<\infty0<K< (that may change from line to line) and 0 < q < 0 < q < 0 < q < oo0<q<\infty0<q< such that
sup t > 0 E | J t , 0 ( 2 ) | K 0 ( | α s | + α s 2 ) ( 1 + E X s q ) d s K 0 ( | α s | + α s 2 ) d s K sup t > 0 E J t , 0 ( 2 ) K 0 α s + α s 2 1 + E X s q d s K 0 α s + α s 2 d s K {:[s u p_(t > 0)E|J_(t,0)^((2))| <= Kint_(0)^(oo)(|alpha_(s)^(')|+alpha_(s)^(2))(1+E||X_(s)||^(q))ds],[ <= Kint_(0)^(oo)(|alpha_(s)^(')|+alpha_(s)^(2))ds],[ <= K]:}\begin{aligned} \sup _{t>0} \mathbb{E}\left|J_{t, 0}^{(2)}\right| & \leq K \int_{0}^{\infty}\left(\left|\alpha_{s}^{\prime}\right|+\alpha_{s}^{2}\right)\left(1+\mathbb{E}\left\|X_{s}\right\|^{q}\right) d s \\ & \leq K \int_{0}^{\infty}\left(\left|\alpha_{s}^{\prime}\right|+\alpha_{s}^{2}\right) d s \\ & \leq K \end{aligned}supt>0E|Jt,0(2)|K0(|αs|+αs2)(1+EXsq)dsK0(|αs|+αs2)dsK
The first inequality follows by Theorem A.1. the second inequality follows by Proposition 1 in [23. and the third inequality follows by Condition 2.1.
The latter display implies that there is a finite random variable J ¯ , 0 ( 2 ) J ¯ , 0 ( 2 ) bar(J)_(oo,0)^((2))\bar{J}_{\infty, 0}^{(2)}J¯,0(2) such that
J t , 0 ( 2 ) J ¯ , 0 ( 2 ) , as t with probability one. J t , 0 ( 2 ) J ¯ , 0 ( 2 ) , as  t  with probability one.  J_(t,0)^((2))rarr bar(J)_(oo,0)^((2))", as "t rarr oo" with probability one. "J_{t, 0}^{(2)} \rightarrow \bar{J}_{\infty, 0}^{(2)} \text {, as } t \rightarrow \infty \text { with probability one. }Jt,0(2)J¯,0(2), as t with probability one. 
The last term that we need to consider is the martingale term
J t , 0 ( 3 ) = 0 t α s x v ( X s , θ s ) , σ d W s + 0 t α s 2 θ v ( X s , θ s ) , θ f ( X s , θ s ) σ 1 d W s . J t , 0 ( 3 ) = 0 t α s x v X s , θ s , σ d W s + 0 t α s 2 θ v X s , θ s , θ f X s , θ s σ 1 d W s . J_(t,0)^((3))=int_(0)^(t)alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)+int_(0)^(t)alpha_(s)^(2)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).J_{t, 0}^{(3)}=\int_{0}^{t} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{0}^{t} \alpha_{s}^{2}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle .Jt,0(3)=0tαsxv(Xs,θs),σdWs+0tαs2θv(Xs,θs),θf(Xs,θs)σ1dWs.
Notice that the Burkholder-Davis-Gundy inequality and the bounds of Theorem A.1 (doing calculations similar to the ones for the term J t , 0 ( 2 ) ) J t , 0 ( 2 ) {:J_(t,0)^((2)))\left.J_{t, 0}^{(2)}\right)Jt,0(2)) give us that for some finite constant K < K < K < ooK<\inftyK<, we have
sup t > 0 E | J t , 0 ( 3 ) | 2 K 0 α s 2 d s < sup t > 0 E J t , 0 ( 3 ) 2 K 0 α s 2 d s < s u p_(t > 0)E|J_(t,0)^((3))|^(2) <= Kint_(0)^(oo)alpha_(s)^(2)ds < oo\sup _{t>0} \mathbb{E}\left|J_{t, 0}^{(3)}\right|^{2} \leq K \int_{0}^{\infty} \alpha_{s}^{2} d s<\inftysupt>0E|Jt,0(3)|2K0αs2ds<
that
Thus, by Doob's martingale convergence theorem there is a square integrable random variable J ¯ , 0 ( 3 ) J ¯ , 0 ( 3 ) bar(J)_(oo,0)^((3))\bar{J}_{\infty, 0}^{(3)}J¯,0(3) such
J t , 0 ( 3 ) J ¯ , 0 ( 3 ) , as t both almost surely and in L 2 . J t , 0 ( 3 ) J ¯ , 0 ( 3 ) , as  t  both almost surely and in  L 2 J_(t,0)^((3))rarr bar(J)_(oo,0)^((3))", as "t rarr oo" both almost surely and in "L^(2)". "J_{t, 0}^{(3)} \rightarrow \bar{J}_{\infty, 0}^{(3)} \text {, as } t \rightarrow \infty \text { both almost surely and in } L^{2} \text {. }Jt,0(3)J¯,0(3), as t both almost surely and in L2
Let us now go back to 3.1 . Using the terms J t ( 1 ) , J t , 0 ( 2 ) J t ( 1 ) , J t , 0 ( 2 ) J_(t)^((1)),J_(t,0)^((2))J_{t}^{(1)}, J_{t, 0}^{(2)}Jt(1),Jt,0(2) and J t , 0 ( 3 ) J t , 0 ( 3 ) J_(t,0)^((3))J_{t, 0}^{(3)}Jt,0(3) we can write
Γ k , η J σ k , η ( 1 ) + J τ k ( 1 ) + J σ k , η , τ k ( 2 ) + J σ k , η , τ k ( 3 ) Γ k , η J σ k , η ( 1 ) + J τ k ( 1 ) + J σ k , η , τ k ( 2 ) + J σ k , η , τ k ( 3 ) ||Gamma_(k,eta)|| <= J_(sigma_(k,eta))^((1))+J_(tau_(k))^((1))+J_(sigma_(k,eta),tau_(k))^((2))+||J_(sigma_(k,eta),tau_(k))^((3))||\left\|\Gamma_{k, \eta}\right\| \leq J_{\sigma_{k, \eta}}^{(1)}+J_{\tau_{k}}^{(1)}+J_{\sigma_{k, \eta}, \tau_{k}}^{(2)}+\left\|J_{\sigma_{k, \eta}, \tau_{k}}^{(3)}\right\|Γk,ηJσk,η(1)+Jτk(1)+Jσk,η,τk(2)+Jσk,η,τk(3)
The last display together with 3.2 , (3.3) and 3.4 imply the statement of the lemma.
Lemma 3.2. Assume that Conditions 2.1, 2.2 and 2.3 hold. Choose λ > 0 λ > 0 lambda > 0\lambda>0λ>0 such that for a given κ > 0 κ > 0 kappa > 0\kappa>0κ>0, one has 3 λ + λ 4 κ = 1 2 L g ¯ 3 λ + λ 4 κ = 1 2 L g ¯ 3lambda+(lambda)/(4kappa)=(1)/(2L_(grad bar(g)))3 \lambda+\frac{\lambda}{4 \kappa}=\frac{1}{2 L_{\nabla \bar{g}}}3λ+λ4κ=12Lg¯, where L g ¯ L g ¯ L_(grad bar(g))L_{\nabla \bar{g}}Lg¯ is the Lipschitz constant of g ¯ g ¯ grad bar(g)\nabla \bar{g}g¯. For k k kkk large enough and for η > 0 η > 0 eta > 0\eta>0η>0 small enough (potentially random depending on k k kkk ), one has τ k σ k , η α s d s > λ τ k σ k , η α s d s > λ int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambdaτkσk,ηαsds>λ. In addition we also have λ 2 τ k σ k α s d s λ λ 2 τ k σ k α s d s λ (lambda)/(2) <= int_(tau_(k))^(sigma_(k))alpha_(s)ds <= lambda\frac{\lambda}{2} \leq \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq \lambdaλ2τkσkαsdsλ with probability one.
Proof. Let us define the random variable
R s = k 1 g ¯ ( θ τ k ) 1 s I k + κ 1 s [ 0 , ) k 1 I k . R s = k 1 g ¯ θ τ k 1 s I k + κ 1 s [ 0 , ) k 1 I k . R_(s)=sum_(k >= 1)||grad( bar(g))(theta_(tau_(k)))||1_(s inI_(k))+kappa1_(s in[0,oo)\\uuu_(k >= 1)I_(k)).R_{s}=\sum_{k \geq 1}\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| 1_{s \in I_{k}}+\kappa 1_{s \in[0, \infty) \backslash \bigcup_{k \geq 1} I_{k}} .Rs=k1g¯(θτk)1sIk+κ1s[0,)k1Ik.
Then, for any s R s R s inRs \in \mathbb{R}sR we have g ¯ ( θ s ) / R s 2 g ¯ θ s / R s 2 ||grad( bar(g))(theta_(s))||//R_(s) <= 2\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| / R_{s} \leq 2g¯(θs)/Rs2.
We proceed with an argument via contradiction. In particular let us assume that τ k σ k , η α s d s λ τ k σ k , η α s d s λ int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds <= lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \leq \lambdaτkσk,ηαsdsλ and let us choose arbitrarily some ϵ > 0 ϵ > 0 epsilon > 0\epsilon>0ϵ>0 such that ϵ λ / 8 ϵ λ / 8 epsilon <= lambda//8\epsilon \leq \lambda / 8ϵλ/8.
Let us now make some remarks that are independent of the sign of τ k σ k , η α s d s λ τ k σ k , η α s d s λ int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds-lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s-\lambdaτkσk,ηαsdsλ. Due to the summability condition 0 α t 2 d t < , κ g ¯ ( θ τ k ) 1 0 α t 2 d t < , κ g ¯ θ τ k 1 int_(0)^(oo)alpha_(t)^(2)dt < oo,(kappa)/(||grad( bar(g))(theta_(tau_(k)))||) <= 1\int_{0}^{\infty} \alpha_{t}^{2} d t<\infty, \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \leq 10αt2dt<,κg¯(θτk)1 and Conditions 2.1 and 2.3 , we have that
sup t > 0 E | 0 t α s κ g ¯ ( θ τ k ) θ f ( X s , θ s ) σ 1 d W s | 2 < sup t > 0 E 0 t α s κ g ¯ θ τ k θ f X s , θ s σ 1 d W s 2 < s u p_(t > 0)E|int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)|^(2) < oo\sup _{t>0} \mathbb{E}\left|\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right|^{2}<\inftysupt>0E|0tαsκg¯(θτk)θf(Xs,θs)σ1dWs|2<
Hence, the martingale convergence theorem applies to the martingale 0 t α s κ g ¯ ( θ τ k ) θ f ( X s , θ s ) σ 1 d W s 0 t α s κ g ¯ θ τ k θ f X s , θ s σ 1 d W s int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}0tαsκg¯(θτk)θf(Xs,θs)σ1dWs. This means that there exists a square integrable random variable M M MMM such that 0 t α s κ g ¯ ( θ τ k ) θ f ( X s , θ s ) σ 1 d W s 0 t α s κ g ¯ θ τ k θ f X s , θ s σ 1 d W s int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)rarr\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s} \rightarrow0tαsκg¯(θτk)θf(Xs,θs)σ1dWs M M MMM both almost surely and in L 2 L 2 L^(2)L^{2}L2. This means that for the given ϵ > 0 ϵ > 0 epsilon > 0\epsilon>0ϵ>0 there is k k kkk large enough such that τ k σ k , η α s κ g ¯ ( θ τ k ) θ f ( X s , θ s ) σ 1 d W s < ϵ τ k σ k , η α s κ g ¯ θ τ k θ f X s , θ s σ 1 d W s < ϵ ||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)|| < epsilon\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\|<\epsilonτkσk,ηαsκg¯(θτk)θf(Xs,θs)σ1dWs<ϵ almost surely.
Let us also assume that for the given k , η k , η k,etak, \etak,η is so small such that for any s [ τ k , σ k , η ] s τ k , σ k , η s in[tau_(k),sigma_(k,eta)]s \in\left[\tau_{k}, \sigma_{k, \eta}\right]s[τk,σk,η] one has g ¯ ( θ s ) g ¯ θ s ||grad( bar(g))(theta_(s))|| <=\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leqg¯(θs) 3 g ¯ ( θ τ k ) 3 g ¯ θ τ k 3||grad( bar(g))(theta_(tau_(k)))||3\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|3g¯(θτk).
Then, we obtain the following
θ σ k , η θ τ k = τ k σ k , η α s θ g ( X s , θ s ) d s + τ k σ k , η α s θ f ( X s , θ s ) σ 1 d W s = τ k σ k , η α s θ g ¯ ( θ s ) d s τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s + τ k σ k , η α s θ f ( X s , θ s ) σ 1 d W s τ k σ k , η α s g ¯ ( θ s ) d s + τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s + τ k σ k , η α s θ f ( X s , θ s ) σ 1 d W s 3 g ¯ ( θ τ k ) τ k σ k , η α s d s + τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s + g ¯ ( θ τ k ) κ τ k σ k , η α s κ g ¯ ( θ τ k ) θ f ( X s , θ s ) σ 1 d W s 3 g ¯ ( θ τ k ) λ + τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s + g ¯ ( θ τ k ) κ ϵ 3 g ¯ ( θ τ k ) λ + τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s + g ¯ ( θ τ k ) κ λ / 8 = g ¯ ( θ τ k ) [ 3 λ + λ 8 κ ] + τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s . θ σ k , η θ τ k = τ k σ k , η α s θ g X s , θ s d s + τ k σ k , η α s θ f X s , θ s σ 1 d W s = τ k σ k , η α s θ g ¯ θ s d s τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s + τ k σ k , η α s θ f X s , θ s σ 1 d W s τ k σ k , η α s g ¯ θ s d s + τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s + τ k σ k , η α s θ f X s , θ s σ 1 d W s 3 g ¯ θ τ k τ k σ k , η α s d s + τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s + g ¯ θ τ k κ τ k σ k , η α s κ g ¯ θ τ k θ f X s , θ s σ 1 d W s 3 g ¯ θ τ k λ + τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s + g ¯ θ τ k κ ϵ 3 g ¯ θ τ k λ + τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s + g ¯ θ τ k κ λ / 8 = g ¯ θ τ k 3 λ + λ 8 κ + τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s . {:[||theta_(sigma_(k,eta))-theta_(tau_(k))||=||-int_(tau_(k))^(sigma_(k,eta))alpha_(s)grad_(theta)g(X_(s),theta_(s))ds+int_(tau_(k))^(sigma_(k,eta))alpha_(s)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)||],[=||-int_(tau_(k))^(sigma_(k,eta))alpha_(s)grad_(theta)( bar(g))(theta_(s))ds-int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds+int_(tau_(k))^(sigma_(k,eta))alpha_(s)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)||],[ <= int_(tau_(k))^(sigma_(k,eta))alpha_(s)||grad( bar(g))(theta_(s))||ds+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds||+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)||],[ <= 3||grad( bar(g))(theta_(tau_(k)))||int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds||],[quad+(||grad( bar(g))(theta_(tau_(k)))||)/(kappa)||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)||],[ <= 3||grad( bar(g))(theta_(tau_(k)))||lambda+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds||+(||grad( bar(g))(theta_(tau_(k)))||)/(kappa)epsilon],[ <= 3||grad( bar(g))(theta_(tau_(k)))||lambda+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds||+(||grad( bar(g))(theta_(tau_(k)))||)/(kappa)lambda//8],[=||grad( bar(g))(theta_(tau_(k)))||[3lambda+(lambda)/(8kappa)]+||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds||.]:}\begin{aligned} \left\|\theta_{\sigma_{k, \eta}}-\theta_{\tau_{k}}\right\|= & \left\|-\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \nabla_{\theta} g\left(X_{s}, \theta_{s}\right) d s+\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\| \\ = & \left\|-\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \nabla_{\theta} \bar{g}\left(\theta_{s}\right) d s-\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s+\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\| \\ \leq & \int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| d s+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s\right\|+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\| \\ \leq & 3\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s\right\| \\ & \quad+\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{\kappa}\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\| \\ \leq & 3\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \lambda+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s\right\|+\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{\kappa} \epsilon \\ \leq & 3\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \lambda+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s\right\|+\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{\kappa} \lambda / 8 \\ = & \left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|\left[3 \lambda+\frac{\lambda}{8 \kappa}\right]+\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s\right\| . \end{aligned}θσk,ηθτk=τkσk,ηαsθg(Xs,θs)ds+τkσk,ηαsθf(Xs,θs)σ1dWs=τkσk,ηαsθg¯(θs)dsτkσk,ηαs(θg(Xs,θs)θg¯(θs))ds+τkσk,ηαsθf(Xs,θs)σ1dWsτkσk,ηαsg¯(θs)ds+τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds+τkσk,ηαsθf(Xs,θs)σ1dWs3g¯(θτk)τkσk,ηαsds+τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds+g¯(θτk)κτkσk,ηαsκg¯(θτk)θf(Xs,θs)σ1dWs3g¯(θτk)λ+τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds+g¯(θτk)κϵ3g¯(θτk)λ+τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds+g¯(θτk)κλ/8=g¯(θτk)[3λ+λ8κ]+τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds.
Let us next bound appropriately the Euclidean norm of the vector-valued random variable
Γ k , η = τ k σ k , η α s ( θ g ( X s , θ s ) θ g ¯ ( θ s ) ) d s . Γ k , η = τ k σ k , η α s θ g X s , θ s θ g ¯ θ s d s . Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds.\Gamma_{k, \eta}=\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s .Γk,η=τkσk,ηαs(θg(Xs,θs)θg¯(θs))ds.
By Lemma 3.1 we have that for the same 0 < ϵ < λ / 8 0 < ϵ < λ / 8 0 < epsilon < lambda//80<\epsilon<\lambda / 80<ϵ<λ/8 that was chosen before there is k k kkk large enough such that almost surely
Γ k , η ϵ λ / 8 . Γ k , η ϵ λ / 8 . ||Gamma_(k,eta)|| <= epsilon <= lambda//8.\left\|\Gamma_{k, \eta}\right\| \leq \epsilon \leq \lambda / 8 .Γk,ηϵλ/8.
Hence, using also the fact that κ g ¯ ( θ τ k ) 1 κ g ¯ θ τ k 1 (kappa)/(||grad( bar(g))(theta_(tau_(k)))||) <= 1\frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \leq 1κg¯(θτk)1 we obtain
θ σ k , η θ τ k g ¯ ( θ τ k ) [ 3 λ + λ 4 κ ] = g ¯ ( θ τ k ) 1 2 L g ¯ . θ σ k , η θ τ k g ¯ θ τ k 3 λ + λ 4 κ = g ¯ θ τ k 1 2 L g ¯ . ||theta_(sigma_(k,eta))-theta_(tau_(k))|| <= ||grad( bar(g))(theta_(tau_(k)))||[3lambda+(lambda)/(4kappa)]=||grad( bar(g))(theta_(tau_(k)))||(1)/(2L_(grad bar(g))).\left\|\theta_{\sigma_{k, \eta}}-\theta_{\tau_{k}}\right\| \leq\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|\left[3 \lambda+\frac{\lambda}{4 \kappa}\right]=\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \frac{1}{2 L_{\nabla \bar{g}}} .θσk,ηθτkg¯(θτk)[3λ+λ4κ]=g¯(θτk)12Lg¯.
The latter then implies that we should have
g ¯ ( θ σ k , η ) g ¯ ( θ τ k ) L g ¯ θ σ k , η θ τ k g ¯ ( θ τ k ) 2 . g ¯ θ σ k , η g ¯ θ τ k L g ¯ θ σ k , η θ τ k g ¯ θ τ k 2 . ||grad( bar(g))(theta_(sigma_(k,eta)))-grad( bar(g))(theta_(tau_(k)))|| <= L_(grad bar(g))||theta_(sigma_(k,eta))-theta_(tau_(k))|| <= (||grad( bar(g))(theta_(tau_(k)))||)/(2).\left\|\nabla \bar{g}\left(\theta_{\sigma_{k, \eta}}\right)-\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \leq L_{\nabla \bar{g}}\left\|\theta_{\sigma_{k, \eta}}-\theta_{\tau_{k}}\right\| \leq \frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} .g¯(θσk,η)g¯(θτk)Lg¯θσk,ηθτkg¯(θτk)2.
The latter statement will then imply that
g ¯ ( θ τ k ) 2 g ¯ ( θ σ k , η ) 2 g ¯ ( θ τ k ) . g ¯ θ τ k 2 g ¯ θ σ k , η 2 g ¯ θ τ k . (||grad( bar(g))(theta_(tau_(k)))||)/(2) <= ||grad( bar(g))(theta_(sigma_(k,eta)))|| <= 2||grad( bar(g))(theta_(tau_(k)))||.\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} \leq\left\|\nabla \bar{g}\left(\theta_{\sigma_{k, \eta}}\right)\right\| \leq 2\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| .g¯(θτk)2g¯(θσk,η)2g¯(θτk).
But then we would necessarily have that τ k σ k , η α s d s > λ τ k σ k , η α s d s > λ int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambdaτkσk,ηαsds>λ, since otherwise σ k , η [ τ k , σ k ] σ k , η τ k , σ k sigma_(k,eta)in[tau_(k),sigma_(k)]\sigma_{k, \eta} \in\left[\tau_{k}, \sigma_{k}\right]σk,η[τk,σk] which is impossible.
Next we move on to prove the second statement of the lemma. By definition we have τ k σ k α s d s λ τ k σ k α s d s λ int_(tau_(k))^(sigma_(k))alpha_(s)ds <= lambda\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq \lambdaτkσkαsdsλ. So it remains to show that λ 2 τ k σ k α s d s λ 2 τ k σ k α s d s (lambda)/(2) <= int_(tau_(k))^(sigma_(k))alpha_(s)ds\frac{\lambda}{2} \leq \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d sλ2τkσkαsds. Since we know that τ k σ k , η α s d s > λ τ k σ k , η α s d s > λ int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambdaτkσk,ηαsds>λ and because for k k kkk large enough and η η eta\etaη small enough one should have σ k σ k , η α s d s λ / 2 σ k σ k , η α s d s λ / 2 int_(sigma_(k))^(sigma_(k,eta))alpha_(s)ds <= lambda//2\int_{\sigma_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \leq \lambda / 2σkσk,ηαsdsλ/2, we obtain that
τ k σ k α s d s λ σ k σ k , η α s d s λ λ / 2 = λ / 2 τ k σ k α s d s λ σ k σ k , η α s d s λ λ / 2 = λ / 2 int_(tau_(k))^(sigma_(k))alpha_(s)ds >= lambda-int_(sigma_(k))^(sigma_(k,eta))alpha_(s)ds >= lambda-lambda//2=lambda//2\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \geq \lambda-\int_{\sigma_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \geq \lambda-\lambda / 2=\lambda / 2τkσkαsdsλσkσk,ηαsdsλλ/2=λ/2
concluding the proof of the lemma.
Lemma 3.3 shows that the function g ¯ g ¯ bar(g)\bar{g}g¯ and its first two derivatives are uniformly bounded in θ θ theta\thetaθ.
Lemma 3.3. Assume Conditions 2.1, 2.2 and 2.3. For any q > 0 q > 0 q > 0q>0q>0, there is a constant K K KKK such that
X ( 1 + | x | q ) π ( d x ) C X 1 + | x | q π ( d x ) C int_(X)(1+|x|^(q))pi(dx) <= C\int_{\mathcal{X}}\left(1+|x|^{q}\right) \pi(d x) \leq CX(1+|x|q)π(dx)C
In addition we also have that there is a constant C < C < C < ooC<\inftyC< such that i = 0 2 θ i g ¯ ( θ ) C i = 0 2 θ i g ¯ ( θ ) C sum_(i=0)^(2)||grad_(theta)^(i)( bar(g))(theta)|| <= C\sum_{i=0}^{2}\left\|\nabla_{\theta}^{i} \bar{g}(\theta)\right\| \leq Ci=02θig¯(θ)C.
Proof. By Theorem 1 in [24, the density μ μ mu\muμ of the measure π π pi\piπ admits, for any p p ppp, a constant C p C p C_(p)C_{p}Cp such that μ ( x ) C p 1 + | x | p μ ( x ) C p 1 + | x | p mu(x) <= (C_(p))/(1+|x|^(p))\mu(x) \leq \frac{C_{p}}{1+|x|^{p}}μ(x)Cp1+|x|p. Choosing p p ppp large enough that X 1 + | x | q 1 + | x | p d y < X 1 + | x | q 1 + | x | p d y < int_(X)(1+|x|^(q))/(1+|x|^(p))dy < oo\int_{\mathcal{X}} \frac{1+|x|^{q}}{1+|x|^{p}} d y<\inftyX1+|x|q1+|x|pdy<, we then obtain
X ( 1 + | x | q ) π ( d x ) X C p 1 + | x | q 1 + | x | p d x C . X 1 + | x | q π ( d x ) X C p 1 + | x | q 1 + | x | p d x C . int_(X)(1+|x|^(q))pi(dx) <= int_(X)C_(p)(1+|x|^(q))/(1+|x|^(p))dx <= C.\int_{\mathcal{X}}\left(1+|x|^{q}\right) \pi(d x) \leq \int_{\mathcal{X}} C_{p} \frac{1+|x|^{q}}{1+|x|^{p}} d x \leq C .X(1+|x|q)π(dx)XCp1+|x|q1+|x|pdxC.
concluding the proof of the first statement of the lemma. Let us now focus on the second part of the lemma. We only prove the claim for i = 0 i = 0 i=0i=0i=0, since due to the bounds in Condition 2.3 the proof for i = 1 , 2 i = 1 , 2 i=1,2i=1,2i=1,2 is the same. By Condition 2.3 and by the first part of the lemma, we have that there exist constants 0 < q , K , C < 0 < q , K , C < 0 < q,K,C < oo0<q, K, C<\infty0<q,K,C< such that
g ¯ ( θ ) = X 1 2 f ( x , θ ) f ( x ) 2 π ( d x ) K X ( 1 + | x | q ) π ( d x ) C , g ¯ ( θ ) = X 1 2 f ( x , θ ) f ( x ) 2 π ( d x ) K X 1 + | x | q π ( d x ) C , bar(g)(theta)=int_(X)(1)/(2)||f(x,theta)-f^(**)(x)||^(2)pi(dx) <= Kint_(X)(1+|x|^(q))pi(dx) <= C,\bar{g}(\theta)=\int_{\mathcal{X}} \frac{1}{2}\left\|f(x, \theta)-f^{*}(x)\right\|^{2} \pi(d x) \leq K \int_{\mathcal{X}}\left(1+|x|^{q}\right) \pi(d x) \leq C,g¯(θ)=X12f(x,θ)f(x)2π(dx)KX(1+|x|q)π(dx)C,
concluding the proof of the lemma.
Our next goal is to show that if the index k k kkk is large enough, then g ¯ g ¯ bar(g)\bar{g}g¯ decreases, in the sense of Lemma 3.4
Lemma 3.4. Assume Conditions 2.1. 2.2 and 2.3. Suppose that there are an infinite number of intervals I k = [ τ k , σ k ) I k = τ k , σ k I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right)Ik=[τk,σk). There is a fixed constant γ = γ ( κ ) > 0 γ = γ ( κ ) > 0 gamma=gamma(kappa) > 0\gamma=\gamma(\kappa)>0γ=γ(κ)>0 such that for k k kkk large enough, one has
g ¯ ( θ σ k ) g ¯ ( θ τ k ) γ . g ¯ θ σ k g ¯ θ τ k γ . bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k))) <= -gamma.\bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) \leq-\gamma .g¯(θσk)g¯(θτk)γ.
Proof. By Itô's formula we have that
g ¯ ( θ σ k ) g ¯ ( θ τ k ) = τ k σ k α s g ¯ ( θ s ) 2 d s + τ k σ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s + τ k σ k α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ θ g ¯ ( θ s ) ] d s + τ k σ k α s θ g ¯ ( θ s ) , θ g ¯ ( θ s ) θ g ( X s , θ s ) d s = Θ 1 , k + Θ 2 , k + Θ 3 , k + Θ 4 , k . g ¯ θ σ k g ¯ θ τ k = τ k σ k α s g ¯ θ s 2 d s + τ k σ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s + τ k σ k α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ θ g ¯ θ s d s + τ k σ k α s θ g ¯ θ s , θ g ¯ θ s θ g X s , θ s d s = Θ 1 , k + Θ 2 , k + Θ 3 , k + Θ 4 , k . {:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k)))=-int_(tau_(k))^(sigma_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds+int_(tau_(k))^(sigma_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(tau_(k))^(sigma_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds],[+int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds],[=Theta_(1,k)+Theta_(2,k)+Theta_(3,k)+Theta_(4,k).]:}\begin{aligned} \bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & =-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s+\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\ & +\int_{\tau_{k}}^{\sigma_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \\ & +\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s \\ & =\Theta_{1, k}+\Theta_{2, k}+\Theta_{3, k}+\Theta_{4, k} . \end{aligned}g¯(θσk)g¯(θτk)=τkσkαsg¯(θs)2ds+τkσkαsg¯(θs),θf(Xs,θs)σ1dWs+τkσkαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θθg¯(θs)]ds+τkσkαsθg¯(θs),θg¯(θs)θg(Xs,θs)ds=Θ1,k+Θ2,k+Θ3,k+Θ4,k.
Let's first consider Θ 1 , k Θ 1 , k Theta_(1,k)\Theta_{1, k}Θ1,k. Notice that for all s [ τ k , σ k ] s τ k , σ k s in[tau_(k),sigma_(k)]s \in\left[\tau_{k}, \sigma_{k}\right]s[τk,σk] one has g ¯ ( θ τ k ) 2 g ¯ ( θ s ) 2 g ¯ ( θ τ k ) g ¯ θ τ k 2 g ¯ θ s 2 g ¯ θ τ k (||grad( bar(g))(theta_(tau_(k)))||)/(2) <= ||grad( bar(g))(theta_(s))|| <= 2||grad( bar(g))(theta_(tau_(k)))||\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} \leq\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leq 2\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|g¯(θτk)2g¯(θs)2g¯(θτk). Hence, for sufficiently large k k kkk, we have the upper bound:
Θ 1 , k = τ k σ k α s g ¯ ( θ s ) 2 d s g ¯ ( θ τ k ) 2 4 τ k σ k α s d s g ¯ ( θ τ k ) 2 8 λ , Θ 1 , k = τ k σ k α s g ¯ θ s 2 d s g ¯ θ τ k 2 4 τ k σ k α s d s g ¯ θ τ k 2 8 λ , Theta_(1,k)=-int_(tau_(k))^(sigma_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(4)int_(tau_(k))^(sigma_(k))alpha_(s)ds <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda,\Theta_{1, k}=-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{4} \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda,Θ1,k=τkσkαsg¯(θs)2dsg¯(θτk)24τkσkαsdsg¯(θτk)28λ,
since Lemma 3.1 proved that τ k σ k α s d s λ 2 τ k σ k α s d s λ 2 int_(tau_(k))^(sigma_(k))alpha_(s)ds >= (lambda)/(2)\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \geq \frac{\lambda}{2}τkσkαsdsλ2 for sufficiently large k k kkk.
We next address Θ 2 , k Θ 2 , k Theta_(2,k)\Theta_{2, k}Θ2,k and show that it becomes small as k k k rarr ook \rightarrow \inftyk. First notice that we can trivially write
Θ 2 , k = τ k σ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s = g ¯ ( θ τ k ) τ k σ k α s g ¯ ( θ s ) g ¯ ( θ τ k ) , θ f ( X s , θ s ) σ 1 d W s = g ¯ ( θ τ k ) τ k σ k α s g ¯ ( θ s ) R s , θ f ( X s , θ s ) σ 1 d W s . Θ 2 , k = τ k σ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s = g ¯ θ τ k τ k σ k α s g ¯ θ s g ¯ θ τ k , θ f X s , θ s σ 1 d W s = g ¯ θ τ k τ k σ k α s g ¯ θ s R s , θ f X s , θ s σ 1 d W s . {:[Theta_(2,k)=int_(tau_(k))^(sigma_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)=||grad( bar(g))(theta_(tau_(k)))||int_(tau_(k))^(sigma_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(||grad( bar(g))(theta_(tau_(k)))||),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[=||grad( bar(g))(theta_(tau_(k)))||int_(tau_(k))^(sigma_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned} \Theta_{2, k} & =\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle=\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\ & =\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle . \end{aligned}Θ2,k=τkσkαsg¯(θs),θf(Xs,θs)σ1dWs=g¯(θτk)τkσkαsg¯(θs)g¯(θτk),θf(Xs,θs)σ1dWs=g¯(θτk)τkσkαsg¯(θs)Rs,θf(Xs,θs)σ1dWs.
By Condition 2.3 and Itô isometry we have
sup t > 0 E | 0 t α s g ¯ ( θ s ) R s , θ f ( X s , θ s ) σ 1 d W s | 2 4 E 0 α s 2 θ f ( X s , θ s ) 2 d s K 0 α s 2 ( 1 + E X s q ) d s < sup t > 0 E 0 t α s g ¯ θ s R s , θ f X s , θ s σ 1 d W s 2 4 E 0 α s 2 θ f X s , θ s 2 d s K 0 α s 2 1 + E X s q d s < {:[s u p_(t > 0)E|int_(0)^(t)alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)|^(2) <= 4Eint_(0)^(oo)alpha_(s)^(2)||grad_(theta)f(X_(s),theta_(s))||^(2)ds],[ <= Kint_(0)^(oo)alpha_(s)^(2)(1+E||X_(s)||^(q))ds < oo]:}\begin{aligned} \sup _{t>0} \mathbb{E}\left|\int_{0}^{t} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle\right|^{2} & \leq 4 \mathbb{E} \int_{0}^{\infty} \alpha_{s}^{2}\left\|\nabla_{\theta} f\left(X_{s}, \theta_{s}\right)\right\|^{2} d s \\ & \leq K \int_{0}^{\infty} \alpha_{s}^{2}\left(1+\mathbb{E}\left\|X_{s}\right\|^{q}\right) d s<\infty \end{aligned}supt>0E|0tαsg¯(θs)Rs,θf(Xs,θs)σ1dWs|24E0αs2θf(Xs,θs)2dsK0αs2(1+EXsq)ds<
where R s R s R_(s)R_{s}Rs is defined via (3.5). Hence, by Doob's martingale convergence theorem there is a square integrable random variable M M MMM such that 0 t α s g ¯ ( θ s ) R s , θ f ( X s , θ s ) d W s M 0 t α s g ¯ θ s R s , θ f X s , θ s d W s M int_(0)^(t)alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))dW_(s):)rarr M\int_{0}^{t} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) d W_{s}\right\rangle \rightarrow M0tαsg¯(θs)Rs,θf(Xs,θs)dWsM both almost surely and in L 2 L 2 L^(2)L^{2}L2. The latter statement implies that for a given ϵ > 0 ϵ > 0 epsilon > 0\epsilon>0ϵ>0 there is k k kkk large enough such that almost surely
Θ 2 , k g ¯ ( θ τ k ) ϵ . Θ 2 , k g ¯ θ τ k ϵ . Theta_(2,k) <= ||grad( bar(g))(theta_(tau_(k)))||epsilon.\Theta_{2, k} \leq\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \epsilon .Θ2,kg¯(θτk)ϵ.
We now consider Θ 3 , k Θ 3 , k Theta_(3,k)\Theta_{3, k}Θ3,k.
sup t > 0 E 0 t α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ θ g ¯ ( θ s ) ] d s C 0 α s 2 2 E ( 1 + X s q ) d s < , sup t > 0 E 0 t α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ θ g ¯ θ s d s C 0 α s 2 2 E 1 + X s q d s < , {:[s u p_(t > 0)E||int_(0)^(t)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds||],[ <= Cint_(0)^(oo)(alpha_(s)^(2))/(2)E(1+||X_(s)||^(q))ds < oo","]:}\begin{aligned} & \sup _{t>0} \mathbb{E}\left\|\int_{0}^{t} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s\right\| \\ \leq & C \int_{0}^{\infty} \frac{\alpha_{s}^{2}}{2} \mathbb{E}\left(1+\left\|X_{s}\right\|^{q}\right) d s<\infty, \end{aligned}supt>0E0tαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θθg¯(θs)]dsC0αs22E(1+Xsq)ds<,
where we have used Condition 2.3 and Lemma 3.3 . Bound 3.7 implies that
0 α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ θ g ¯ ( θ s ) ] d s 0 α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ θ g ¯ θ s d s int_(0)^(oo)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds\int_{0}^{\infty} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s0αs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θθg¯(θs)]ds
is finite almost surely, which in turn implies that there is a finite random variable Θ 3 Θ 3 Theta_(3)^(oo)\Theta_{3}^{\infty}Θ3 such that
0 t α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ θ g ¯ ( θ s ) ] d s Θ 3 as t 0 t α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ θ g ¯ θ s d s Θ 3  as  t int_(0)^(t)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds rarrTheta_(3)^(oo)" as "t rarr oo\int_{0}^{t} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \rightarrow \Theta_{3}^{\infty} \text { as } t \rightarrow \infty0tαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θθg¯(θs)]dsΘ3 as t
with probability one. Since Θ 3 Θ 3 Theta_(3)^(oo)\Theta_{3}^{\infty}Θ3 is finite, τ k σ k α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ θ g ¯ ( θ s ) ] d s 0 τ k σ k α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ θ g ¯ θ s d s 0 int_(tau_(k))^(sigma_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds rarr0\int_{\tau_{k}}^{\sigma_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \rightarrow 0τkσkαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θθg¯(θs)]ds0 as k k k rarr ook \rightarrow \inftyk with probability one.
Finally, we address Θ 4 , k Θ 4 , k Theta_(4,k)\Theta_{4, k}Θ4,k. Let us consider the function G ( x , θ ) = θ g ¯ ( θ ) , θ g ( x , θ ) θ g ¯ ( θ ) G ( x , θ ) = θ g ¯ ( θ ) , θ g ( x , θ ) θ g ¯ ( θ ) G(x,theta)=(:grad_(theta)( bar(g))(theta),grad_(theta)g(x,theta)-grad_(theta)( bar(g))(theta):)G(x, \theta)=\left\langle\nabla_{\theta} \bar{g}(\theta), \nabla_{\theta} g(x, \theta)-\nabla_{\theta} \bar{g}(\theta)\right\rangleG(x,θ)=θg¯(θ),θg(x,θ)θg¯(θ). The function G ( x , θ ) G ( x , θ ) G(x,theta)G(x, \theta)G(x,θ) satisfies the centering condition A.1 of Theorem A.1. Therefore, the Poisson equation A.2 with right hand side G ( x , θ ) G ( x , θ ) G(x,theta)G(x, \theta)G(x,θ) will have a unique smooth solution, say v ( x , θ ) v ( x , θ ) v(x,theta)v(x, \theta)v(x,θ), that grows at most polynomially in x x xxx. Let us apply Itô formula to the function u ( t , x , θ ) = α t v ( x , θ ) u ( t , x , θ ) = α t v ( x , θ ) u(t,x,theta)=alpha_(t)v(x,theta)u(t, x, \theta)=\alpha_{t} v(x, \theta)u(t,x,θ)=αtv(x,θ) that is solution to this Poisson equation.
u ( σ , X σ , θ σ ) u ( τ , X τ , θ τ ) = τ σ s u ( s , X s , θ s ) d s + τ σ L x u ( s , X s , θ s ) d s + τ σ L θ u ( s , X s , θ s ) d s + τ σ α s tr [ θ f ( X s , θ s ) x θ u ( s , X s , θ s ) ] d s + τ σ x u ( s , X s , θ s ) , σ d W s + τ σ α s θ u ( s , X s , θ s ) , θ f ( X s , θ s ) σ 1 d W s . u σ , X σ , θ σ u τ , X τ , θ τ = τ σ s u s , X s , θ s d s + τ σ L x u s , X s , θ s d s + τ σ L θ u s , X s , θ s d s + τ σ α s tr θ f X s , θ s x θ u s , X s , θ s d s + τ σ x u s , X s , θ s , σ d W s + τ σ α s θ u s , X s , θ s , θ f X s , θ s σ 1 d W s . {:[u(sigma,X_(sigma),theta_(sigma))-u(tau,X_(tau),theta_(tau))=int_(tau)^(sigma)del_(s)u(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(x)u(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(theta)u(s,X_(s),theta_(s))ds],[+int_(tau)^(sigma)alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)u(s,X_(s),theta_(s))]ds],[+int_(tau)^(sigma)(:grad_(x)u(s,X_(s),theta_(s)),sigma dW_(s):)+int_(tau)^(sigma)alpha_(s)(:grad_(theta)u(s,X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned} u\left(\sigma, X_{\sigma}, \theta_{\sigma}\right)-u\left(\tau, X_{\tau}, \theta_{\tau}\right) & =\int_{\tau}^{\sigma} \partial_{s} u\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{\theta} u\left(s, X_{s}, \theta_{s}\right) d s \\ & +\int_{\tau}^{\sigma} \alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} u\left(s, X_{s}, \theta_{s}\right)\right] d s \\ & +\int_{\tau}^{\sigma}\left\langle\nabla_{x} u\left(s, X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\theta} u\left(s, X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle . \end{aligned}u(σ,Xσ,θσ)u(τ,Xτ,θτ)=τσsu(s,Xs,θs)ds+τσLxu(s,Xs,θs)ds+τσLθu(s,Xs,θs)ds+τσαstr[θf(Xs,θs)xθu(s,Xs,θs)]ds+τσxu(s,Xs,θs),σdWs+τσαsθu(s,Xs,θs),θf(Xs,θs)σ1dWs.
Rearranging the previous Itô formula yields
Θ 4 , k = τ k σ k α s θ g ¯ ( θ t ) , θ g ( X s , θ s ) θ g ¯ ( θ s ) d s = τ k σ k L x u ( s , X s , θ s ) d s = [ α σ k v ( X σ k , θ σ k ) α τ k v ( X τ k , θ τ k ) τ k σ k s α s v ( X s , θ s ) d s ] τ k σ k α s [ L θ v ( X s , θ s ) + α s tr [ θ f ( X s , θ s ) x θ v ( X s , θ s ) ] ] d s τ k σ k α s x v ( X s , θ s ) , σ d W s τ k σ k α s θ v ( X s , θ s ) , θ f ( X s , θ s ) σ 1 d W s . Θ 4 , k = τ k σ k α s θ g ¯ θ t , θ g X s , θ s θ g ¯ θ s d s = τ k σ k L x u s , X s , θ s d s = α σ k v X σ k , θ σ k α τ k v X τ k , θ τ k τ k σ k s α s v X s , θ s d s τ k σ k α s L θ v X s , θ s + α s tr θ f X s , θ s x θ v X s , θ s d s τ k σ k α s x v X s , θ s , σ d W s τ k σ k α s θ v X s , θ s , θ f X s , θ s σ 1 d W s . {:[Theta_(4,k)=int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(t)),grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)):)ds=int_(tau_(k))^(sigma_(k))L_(x)u(s,X_(s),theta_(s))ds],[=[alpha_(sigma_(k))v(X_(sigma_(k)),theta_(sigma_(k)))-alpha_(tau_(k))v(X_(tau_(k)),theta_(tau_(k)))-int_(tau_(k))^(sigma_(k))del_(s)alpha_(s)v(X_(s),theta_(s))ds]],[-int_(tau_(k))^(sigma_(k))alpha_(s)[L_(theta)v(X_(s),theta_(s))+alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)v(X_(s),theta_(s))]]ds],[-int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)-int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned} \Theta_{4, k}= & \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{t}\right), \nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right\rangle d s=\int_{\tau_{k}}^{\sigma_{k}} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s \\ = & {\left[\alpha_{\sigma_{k}} v\left(X_{\sigma_{k}}, \theta_{\sigma_{k}}\right)-\alpha_{\tau_{k}} v\left(X_{\tau_{k}}, \theta_{\tau_{k}}\right)-\int_{\tau_{k}}^{\sigma_{k}} \partial_{s} \alpha_{s} v\left(X_{s}, \theta_{s}\right) d s\right] } \\ & -\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left[\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]\right] d s \\ & -\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle . \end{aligned}Θ4,k=τkσkαsθg¯(θt),θg(Xs,θs)θg¯(θs)ds=τkσkLxu(s,Xs,θs)ds=[ασkv(Xσk,θσk)ατkv(Xτk,θτk)τkσksαsv(Xs,θs)ds]τkσkαs[Lθv(Xs,θs)+αstr[θf(Xs,θs)xθv(Xs,θs)]]dsτkσkαsxv(Xs,θs),σdWsτkσkαsθv(Xs,θs),θf(Xs,θs)σ1dWs.
Following the exact same steps as in the proof of Lemma 3.1 gives us that lim k Θ 4 , k 0 lim k Θ 4 , k 0 lim_(k rarr oo)||Theta_(4,k)||rarr0\lim _{k \rightarrow \infty}\left\|\Theta_{4, k}\right\| \rightarrow 0limkΘ4,k0 almost surely.
We now return to g ¯ ( θ σ k ) g ¯ ( θ τ k ) g ¯ θ σ k g ¯ θ τ k bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k)))\bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right)g¯(θσk)g¯(θτk) and provide an upper bound which is negative. For sufficiently large k k kkk, we have that:
g ¯ ( θ σ k ) g ¯ ( θ τ k ) g ¯ ( θ τ k ) 2 8 λ + Θ 2 , k + Θ 3 , k + Θ 4 , k g ¯ ( θ τ k ) 2 8 λ + g ¯ ( θ τ k ) ϵ + ϵ + ϵ . g ¯ θ σ k g ¯ θ τ k g ¯ θ τ k 2 8 λ + Θ 2 , k + Θ 3 , k + Θ 4 , k g ¯ θ τ k 2 8 λ + g ¯ θ τ k ϵ + ϵ + ϵ . {:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k))) <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda+||Theta_(2,k)||+||Theta_(3,k)||+||Theta_(4,k)||],[ <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda+||grad( bar(g))(theta_(tau_(k)))||epsilon+epsilon+epsilon.]:}\begin{aligned} \bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda+\left\|\Theta_{2, k}\right\|+\left\|\Theta_{3, k}\right\|+\left\|\Theta_{4, k}\right\| \\ & \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda+\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \epsilon+\epsilon+\epsilon . \end{aligned}g¯(θσk)g¯(θτk)g¯(θτk)28λ+Θ2,k+Θ3,k+Θ4,kg¯(θτk)28λ+g¯(θτk)ϵ+ϵ+ϵ.
Choose ϵ = min { λ κ 2 32 , λ 32 } ϵ = min λ κ 2 32 , λ 32 epsilon=min{(lambdakappa^(2))/(32),(lambda)/(32)}\epsilon=\min \left\{\frac{\lambda \kappa^{2}}{32}, \frac{\lambda}{32}\right\}ϵ=min{λκ232,λ32}. On the one hand, if g ¯ ( θ τ k ) 1 g ¯ θ τ k 1 ||grad( bar(g))(theta_(tau_(k)))|| >= 1\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \geq 1g¯(θτk)1 :
g ¯ ( θ σ k ) g ¯ ( θ τ k ) g ¯ ( θ τ k ) 2 8 λ + g ¯ ( θ τ k ) 2 ϵ + ϵ + ϵ 3 g ¯ ( θ τ k ) 2 32 λ + 2 ϵ 3 κ 2 32 λ + 2 κ 2 32 λ κ 2 32 λ . g ¯ θ σ k g ¯ θ τ k g ¯ θ τ k 2 8 λ + g ¯ θ τ k 2 ϵ + ϵ + ϵ 3 g ¯ θ τ k 2 32 λ + 2 ϵ 3 κ 2 32 λ + 2 κ 2 32 λ κ 2 32 λ . {:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k))) <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda+||grad( bar(g))(theta_(tau_(k)))||^(2)epsilon+epsilon+epsilon],[ <= -3(||grad( bar(g))(theta_(tau_(k)))||^(2))/(32)lambda+2epsilon <= -3(kappa^(2))/(32)lambda+2(kappa^(2))/(32)lambda <= -(kappa^(2))/(32)lambda.]:}\begin{aligned} \bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda+\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2} \epsilon+\epsilon+\epsilon \\ & \leq-3 \frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{32} \lambda+2 \epsilon \leq-3 \frac{\kappa^{2}}{32} \lambda+2 \frac{\kappa^{2}}{32} \lambda \leq-\frac{\kappa^{2}}{32} \lambda . \end{aligned}g¯(θσk)g¯(θτk)g¯(θτk)28λ+g¯(θτk)2ϵ+ϵ+ϵ3g¯(θτk)232λ+2ϵ3κ232λ+2κ232λκ232λ.
On the other hand, if g ¯ ( θ τ k ) 1 g ¯ θ τ k 1 ||grad( bar(g))(theta_(tau_(k)))|| <= 1\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \leq 1g¯(θτk)1, then
g ¯ ( θ σ k ) g ¯ ( θ τ k ) g ¯ ( θ τ k ) 2 8 λ + ϵ + ϵ + ϵ 4 κ 2 32 λ + 3 ϵ 4 κ 2 32 λ + 3 κ 2 32 λ κ 2 32 λ . g ¯ θ σ k g ¯ θ τ k g ¯ θ τ k 2 8 λ + ϵ + ϵ + ϵ 4 κ 2 32 λ + 3 ϵ 4 κ 2 32 λ + 3 κ 2 32 λ κ 2 32 λ . {:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k))) <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda+epsilon+epsilon+epsilon],[ <= -(4kappa^(2))/(32)lambda+3epsilon <= -4(kappa^(2))/(32)lambda+3(kappa^(2))/(32)lambda <= -(kappa^(2))/(32)lambda.]:}\begin{aligned} \bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda+\epsilon+\epsilon+\epsilon \\ & \leq-\frac{4 \kappa^{2}}{32} \lambda+3 \epsilon \leq-4 \frac{\kappa^{2}}{32} \lambda+3 \frac{\kappa^{2}}{32} \lambda \leq-\frac{\kappa^{2}}{32} \lambda . \end{aligned}g¯(θσk)g¯(θτk)g¯(θτk)28λ+ϵ+ϵ+ϵ4κ232λ+3ϵ4κ232λ+3κ232λκ232λ.
Finally, let γ = κ 2 32 λ γ = κ 2 32 λ gamma=(kappa^(2))/(32)lambda\gamma=\frac{\kappa^{2}}{32} \lambdaγ=κ232λ and the proof of the lemma is complete.
Lemma 3.5. Assume Conditions 2.1, 2.2 and 2.3. Suppose that there are an infinite number of intervals I k = [ τ k , σ k ) I k = τ k , σ k I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right)Ik=[τk,σk). There is a fixed constant γ 1 < γ γ 1 < γ gamma_(1) < gamma\gamma_{1}<\gammaγ1<γ such that for k k kkk large enough,
g ¯ ( θ τ k ) g ¯ ( θ σ k 1 ) γ 1 . g ¯ θ τ k g ¯ θ σ k 1 γ 1 . bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1))) <= gamma_(1).\bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) \leq \gamma_{1} .g¯(θτk)g¯(θσk1)γ1.
Proof. First, recall that g ¯ ( θ t ) κ g ¯ θ t κ ||grad( bar(g))(theta_(t))|| <= kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \leq \kappag¯(θt)κ for t J k = [ σ k 1 , τ k ] t J k = σ k 1 , τ k t inJ_(k)=[sigma_(k-1),tau_(k)]t \in J_{k}=\left[\sigma_{k-1}, \tau_{k}\right]tJk=[σk1,τk]. Similar to before, we have that:
g ¯ ( θ τ k ) g ¯ ( θ σ k 1 ) = σ k 1 τ k α s g ¯ ( θ s ) 2 d s + σ k 1 τ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s + σ k 1 τ k α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ 2 g ¯ ( θ s ) ] d s + σ k 1 τ k α s θ g ¯ ( θ s ) , θ g ¯ ( θ s ) θ g ( X s , θ s ) d s σ k 1 τ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s + σ k 1 τ k α s 2 2 tr [ ( θ f ( X s , θ s ) σ 1 ) ( θ f ( X s , θ s ) σ 1 ) θ 2 g ¯ ( θ s ) ] d s + σ k 1 τ k α s θ g ¯ ( θ s ) , θ g ¯ ( θ s ) θ g ( X s , θ s ) d s . g ¯ θ τ k g ¯ θ σ k 1 = σ k 1 τ k α s g ¯ θ s 2 d s + σ k 1 τ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s + σ k 1 τ k α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ 2 g ¯ θ s d s + σ k 1 τ k α s θ g ¯ θ s , θ g ¯ θ s θ g X s , θ s d s σ k 1 τ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s + σ k 1 τ k α s 2 2 tr θ f X s , θ s σ 1 θ f X s , θ s σ 1 θ 2 g ¯ θ s d s + σ k 1 τ k α s θ g ¯ θ s , θ g ¯ θ s θ g X s , θ s d s . {:[ bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1)))=-int_(sigma_(k-1))^(tau_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(sigma_(k-1))^(tau_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)^(2)( bar(g))(theta_(s))]ds],[+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds],[ <= int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(sigma_(k-1))^(tau_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)^(2)( bar(g))(theta_(s))]ds],[+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds.]:}\begin{aligned} \bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) & =-\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s+\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\ & +\int_{\sigma_{k-1}}^{\tau_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta}^{2} \bar{g}\left(\theta_{s}\right)\right] d s \\ & +\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s \\ & \leq \int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\ & +\int_{\sigma_{k-1}}^{\tau_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta}^{2} \bar{g}\left(\theta_{s}\right)\right] d s \\ & +\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s . \end{aligned}g¯(θτk)g¯(θσk1)=σk1τkαsg¯(θs)2ds+σk1τkαsg¯(θs),θf(Xs,θs)σ1dWs+σk1τkαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θ2g¯(θs)]ds+σk1τkαsθg¯(θs),θg¯(θs)θg(Xs,θs)dsσk1τkαsg¯(θs),θf(Xs,θs)σ1dWs+σk1τkαs22tr[(θf(Xs,θs)σ1)(θf(Xs,θs)σ1)θ2g¯(θs)]ds+σk1τkαsθg¯(θs),θg¯(θs)θg(Xs,θs)ds.
The right hand side (RHS) of equation (3.8) converges almost surely to 0 as k k k rarr ook \rightarrow \inftyk as a consequence of similar arguments as given in Lemma 3.4 Indeed, the treatment of the second and third terms on the RHS of (3.8) are exactly the same as in Lemma 3.4. It remains to show that the first term on the RHS of (3.8) converges almost surely to 0 as k k k rarr ook \rightarrow \inftyk.
σ k 1 τ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s = g ¯ ( θ σ k 1 ) σ k 1 τ k α s g ¯ ( θ s ) g ¯ ( θ σ k 1 ) , θ f ( X s , θ s ) σ 1 d W s = g ¯ ( θ σ k 1 ) σ k 1 τ k α s g ¯ ( θ s ) R s , θ f ( X s , θ s ) σ 1 d W s . σ k 1 τ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s = g ¯ θ σ k 1 σ k 1 τ k α s g ¯ θ s g ¯ θ σ k 1 , θ f X s , θ s σ 1 d W s = g ¯ θ σ k 1 σ k 1 τ k α s g ¯ θ s R s , θ f X s , θ s σ 1 d W s . {:[int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)=||grad( bar(g))(theta_(sigma_(k-1)))||int_(sigma_(k-1))^(tau_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(||grad( bar(g))(theta_(sigma_(k-1)))||),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[=||grad( bar(g))(theta_(sigma_(k-1)))||int_(sigma_(k-1))^(tau_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned} \int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle & =\left\|\nabla \bar{g}\left(\theta_{\sigma_{k-1}}\right)\right\| \int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{\left\|\nabla \bar{g}\left(\theta_{\sigma_{k-1}}\right)\right\|}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\ & =\left\|\nabla \bar{g}\left(\theta_{\sigma_{k-1}}\right)\right\| \int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle . \end{aligned}σk1τkαsg¯(θs),θf(Xs,θs)σ1dWs=g¯(θσk1)σk1τkαsg¯(θs)g¯(θσk1),θf(Xs,θs)σ1dWs=g¯(θσk1)σk1τkαsg¯(θs)Rs,θf(Xs,θs)σ1dWs.
As shown in Lemma 3.4. σ k 1 τ k α s g ¯ ( θ s ) R s , θ f ( X s , θ s ) σ 1 d W s 0 σ k 1 τ k α s g ¯ θ s R s , θ f X s , θ s σ 1 d W s 0 int_(sigma_(k-1))^(tau_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)rarr0\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \rightarrow 0σk1τkαsg¯(θs)Rs,θf(Xs,θs)σ1dWs0 as k k k rarr ook \rightarrow \inftyk almost surely. Finally, note that g ¯ ( θ σ k 1 ) κ g ¯ θ σ k 1 κ ||grad( bar(g))(theta_(sigma_(k-1)))|| <= kappa\left\|\nabla \bar{g}\left(\theta_{\sigma_{k-1}}\right)\right\| \leq \kappag¯(θσk1)κ (except when σ k 1 = τ k σ k 1 = τ k sigma_(k-1)=tau_(k)\sigma_{k-1}=\tau_{k}σk1=τk, in which case the interval J k J k J_(k)J_{k}Jk is length 0 and hence the integral 3.9 over J k J k J_(k)J_{k}Jk is 0 ) ) ))). Then, σ k 1 τ k α s g ¯ ( θ s ) , θ f ( X s , θ s ) σ 1 d W s 0 σ k 1 τ k α s g ¯ θ s , θ f X s , θ s σ 1 d W s 0 int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)rarr0\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \rightarrow 0σk1τkαsg¯(θs),θf(Xs,θs)σ1dWs0 as k k k rarr ook \rightarrow \inftyk almost surely.
Therefore, with probability one, g ¯ ( θ τ k ) g ¯ ( θ σ k 1 ) γ 1 < γ g ¯ θ τ k g ¯ θ σ k 1 γ 1 < γ bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1))) <= gamma_(1) < gamma\bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) \leq \gamma_{1}<\gammag¯(θτk)g¯(θσk1)γ1<γ for sufficiently large k k kkk.
Proof of Theorem 2.4. Choose a κ > 0 κ > 0 kappa > 0\kappa>0κ>0. First, consider the case where there are a finite number of times τ k τ k tau_(k)\tau_{k}τk. Then, there is a finite T T TTT such that g ¯ ( θ t ) < κ g ¯ θ t < κ ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappag¯(θt)<κ for t T t T t >= Tt \geq TtT. Now, consider the other case where there are an infinite number of times τ k τ k tau_(k)\tau_{k}τk and use Lemmas 3.4 and 3.5 . With probability one,
g ¯ ( θ σ k ) g ¯ ( θ τ k ) γ = κ 2 32 λ , g ¯ ( θ τ k ) g ¯ ( θ σ k 1 ) γ 1 < γ , g ¯ θ σ k g ¯ θ τ k γ = κ 2 32 λ , g ¯ θ τ k g ¯ θ σ k 1 γ 1 < γ , {:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k))) <= -gamma=-(kappa^(2))/(32)lambda","],[ bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1))) <= gamma_(1) < gamma","]:}\begin{aligned} \bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & \leq-\gamma=-\frac{\kappa^{2}}{32} \lambda, \\ \bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) & \leq \gamma_{1}<\gamma, \end{aligned}g¯(θσk)g¯(θτk)γ=κ232λ,g¯(θτk)g¯(θσk1)γ1<γ,
for sufficiently large k k kkk. Choose a K K KKK such that 3.10 holds for k K k K k >= Kk \geq KkK. This leads to:
g ¯ ( θ τ n + 1 ) g ¯ ( θ τ K ) = k = K n [ g ¯ ( θ σ k ) g ¯ ( θ τ k ) + g ¯ ( θ τ k + 1 ) g ¯ ( θ σ k ) ] k = K n ( γ + γ 1 ) < 0 . g ¯ θ τ n + 1 g ¯ θ τ K = k = K n g ¯ θ σ k g ¯ θ τ k + g ¯ θ τ k + 1 g ¯ θ σ k k = K n γ + γ 1 < 0 . bar(g)(theta_(tau_(n+1)))- bar(g)(theta_(tau_(K)))=sum_(k=K)^(n)[( bar(g))(theta_(sigma_(k)))-( bar(g))(theta_(tau_(k)))+( bar(g))(theta_(tau_(k+1)))-( bar(g))(theta_(sigma_(k)))] <= sum_(k=K)^(n)(-gamma+gamma_(1)) < 0.\bar{g}\left(\theta_{\tau_{n+1}}\right)-\bar{g}\left(\theta_{\tau_{K}}\right)=\sum_{k=K}^{n}\left[\bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right)+\bar{g}\left(\theta_{\tau_{k+1}}\right)-\bar{g}\left(\theta_{\sigma_{k}}\right)\right] \leq \sum_{k=K}^{n}\left(-\gamma+\gamma_{1}\right)<0 .g¯(θτn+1)g¯(θτK)=k=Kn[g¯(θσk)g¯(θτk)+g¯(θτk+1)g¯(θσk)]k=Kn(γ+γ1)<0.
Let n n n rarr oon \rightarrow \inftyn and then g ¯ ( θ τ n + 1 ) g ¯ θ τ n + 1 bar(g)(theta_(tau_(n+1)))rarr-oo\bar{g}\left(\theta_{\tau_{n+1}}\right) \rightarrow-\inftyg¯(θτn+1). However, we also have that by definition g ¯ ( θ ) 0 g ¯ ( θ ) 0 bar(g)(theta) >= 0\bar{g}(\theta) \geq 0g¯(θ)0. This is a contradiction, and therefore almost surely there are a finite number of times τ k τ k tau_(k)\tau_{k}τk.
Consequently, there exists a finite time T T TTT (possibly random) such that almost surely g ¯ ( θ t ) < κ g ¯ θ t < κ ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappag¯(θt)<κ for t T t T t >= Tt \geq TtT. Since the original κ > 0 κ > 0 kappa > 0\kappa>0κ>0 was arbitrarily chosen, this shows that g ¯ ( θ t ) 0 g ¯ θ t 0 ||grad( bar(g))(theta_(t))||rarr0\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \rightarrow 0g¯(θt)0 as t t t rarr oot \rightarrow \inftyt almost surely.

5. Estimating the Coefficient Function of the Diffusion Term and Generalizations

We consider a diffusion X t X = R m X t X = R m X_(t)inX=R^(m)X_{t} \in \mathcal{X}=\mathbb{R}^{m}XtX=Rm :
d X t = f ( X t ) d t + σ ( X t ) d W t . d X t = f X t d t + σ X t d W t . dX_(t)=f^(**)(X_(t))dt+sigma^(**)(X_(t))dW_(t).d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma^{*}\left(X_{t}\right) d W_{t} .dXt=f(Xt)dt+σ(Xt)dWt.
The goal is to statistically estimate a model f ( x , θ ) f ( x , θ ) f(x,theta)f(x, \theta)f(x,θ) for f ( x ) f ( x ) f^(**)(x)f^{*}(x)f(x) as well as a model σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu)σ(x,ν)σ(x,ν) for the diffusion coefficient σ ( x ) σ , ( x ) σ ( x ) σ , ( x ) sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x)σ(x)σ,(x) where θ R n θ R n theta inR^(n)\theta \in \mathbb{R}^{n}θRn and ν R k ν R k nu inR^(k)\nu \in \mathbb{R}^{k}νRk. W t R m W t R m W_(t)inR^(m)W_{t} \in \mathbb{R}^{m}WtRm is a standard Brownian motion and σ ( ) R m × m σ ( ) R m × m sigma^(**)(*)inR^(m xx m)\sigma^{*}(\cdot) \in \mathbb{R}^{m \times m}σ()Rm×m. The functions f ( x , θ ) , σ ( x , ν ) , f ( x ) f ( x , θ ) , σ ( x , ν ) , f ( x ) f(x,theta),sigma(x,nu),f^(**)(x)f(x, \theta), \sigma(x, \nu), f^{*}(x)f(x,θ),σ(x,ν),f(x), and σ ( x ) σ ( x ) sigma^(**)(x)\sigma^{*}(x)σ(x) may be non-convex.
The stochastic gradient descent update in continuous time follows the stochastic differential equations:
d θ t = α t θ f ( X t , θ t ) [ d X t f ( X t , θ t ) d t ] , d ν t = α t i , j m ν ( ( σ ( X t , ν t ) σ ( X t , ν t ) ) i , j ) [ d X t , X t i , j ( σ ( X t , ν t ) σ ( X t , ν t ) ) i , j d t ] , d θ t = α t θ f X t , θ t d X t f X t , θ t d t , d ν t = α t i , j m ν σ X t , ν t σ X t , ν t i , j d X t , X t i , j σ X t , ν t σ X t , ν t i , j d t , {:[dtheta_(t)=alpha_(t)grad_(theta)f(X_(t),theta_(t))[dX_(t)-f(X_(t),theta_(t))dt]","],[dnu_(t)=alpha_(t)sum_(i,j)^(m)grad_(nu)((sigma(X_(t),nu_(t))sigma^(TT)(X_(t),nu_(t)))_(i,j))[d(:X_(t),X_(t):)_(i,j)-(sigma(X_(t),nu_(t))sigma^(TT)(X_(t),nu_(t)))_(i,j)dt]","]:}\begin{aligned} d \theta_{t} & =\alpha_{t} \nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left[d X_{t}-f\left(X_{t}, \theta_{t}\right) d t\right], \\ d \nu_{t} & =\alpha_{t} \sum_{i, j}^{m} \nabla_{\nu}\left(\left(\sigma\left(X_{t}, \nu_{t}\right) \sigma^{\top}\left(X_{t}, \nu_{t}\right)\right)_{i, j}\right)\left[d\left\langle X_{t}, X_{t}\right\rangle_{i, j}-\left(\sigma\left(X_{t}, \nu_{t}\right) \sigma^{\top}\left(X_{t}, \nu_{t}\right)\right)_{i, j} d t\right], \end{aligned}dθt=αtθf(Xt,θt)[dXtf(Xt,θt)dt],dνt=αti,jmν((σ(Xt,νt)σ(Xt,νt))i,j)[dXt,Xti,j(σ(Xt,νt)σ(Xt,νt))i,jdt],
where X t , X t R m × m X t , X t R m × m (:X_(t),X_(t):)inR^(m xx m)\left\langle X_{t}, X_{t}\right\rangle \in \mathbb{R}^{m \times m}Xt,XtRm×m is the quadratic variation matrix of X X XXX. Since we observe the path of X t X t X_(t)X_{t}Xt, we also observe the path of the quadratic variation X t , X t X t , X t (:X_(t),X_(t):)\left\langle X_{t}, X_{t}\right\rangleXt,Xt.
Let us set:
g ( x , θ ) = 1 2 f ( x , θ ) f ( x ) 2 w ( x , ν ) = 1 2 σ ( x , ν ) σ ( x , ν ) σ ( x ) σ , ( x ) 2 . g ( x , θ ) = 1 2 f ( x , θ ) f ( x ) 2 w ( x , ν ) = 1 2 σ ( x , ν ) σ ( x , ν ) σ ( x ) σ , ( x ) 2 . {:[g(x","theta)=(1)/(2)||f(x,theta)-f^(**)(x)||^(2)],[w(x","nu)=(1)/(2)||sigma(x,nu)sigma^(TT)(x,nu)-sigma^(**)(x)sigma^(**,TT)(x)||^(2).]:}\begin{aligned} g(x, \theta) & =\frac{1}{2}\left\|f(x, \theta)-f^{*}(x)\right\|^{2} \\ w(x, \nu) & =\frac{1}{2}\left\|\sigma(x, \nu) \sigma^{\top}(x, \nu)-\sigma^{*}(x) \sigma^{*, \top}(x)\right\|^{2} . \end{aligned}g(x,θ)=12f(x,θ)f(x)2w(x,ν)=12σ(x,ν)σ(x,ν)σ(x)σ,(x)2.
We assume that σ ( x ) σ ( x ) sigma^(**)(x)\sigma^{*}(x)σ(x) is such that the process X t X t X_(t)X_{t}Xt is ergodic with a unique invariant measure (for example one may assume that it is non-degenerate, i.e., bounded away from zero and bounded by above). In addition, we assume that w ( x , ν ) w ( x , ν ) w(x,nu)w(x, \nu)w(x,ν) satisfies the same assumptions as g ( x , θ ) g ( x , θ ) g(x,theta)g(x, \theta)g(x,θ) does in Condition 2.3 .
From the previous results in Section 3 , lim t g ¯ ( θ t ) = 0 3 , lim t g ¯ θ t = 0 3,lim_(t rarr oo)||grad( bar(g))(theta_(t))||=03, \lim _{t \rightarrow \infty}\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|=03,limtg¯(θt)=0 as t t t rarr oot \rightarrow \inftyt with probability 1 . Let's study the convergence of the stochastic gradient descent algorithm 4.2 for ν t ν t nu_(t)\nu_{t}νt. By Itô's formula,
w ¯ ( ν σ ) w ¯ ( ν τ ) = τ σ α s w ¯ ( ν s ) 2 d s + τ σ α s ν w ¯ ( ν s ) , ν w ¯ ( ν s ) ν w ( X s , ν s ) d s w ¯ ν σ w ¯ ν τ = τ σ α s w ¯ ν s 2 d s + τ σ α s ν w ¯ ν s , ν w ¯ ν s ν w X s , ν s d s bar(w)(nu_(sigma))- bar(w)(nu_(tau))=-int_(tau)^(sigma)alpha_(s)||grad( bar(w))(nu_(s))||^(2)ds+int_(tau)^(sigma)alpha_(s)(:grad_(nu)( bar(w))(nu_(s)),grad_(nu)( bar(w))(nu_(s))-grad_(nu)w(X_(s),nu_(s)):)ds\bar{w}\left(\nu_{\sigma}\right)-\bar{w}\left(\nu_{\tau}\right)=-\int_{\tau}^{\sigma} \alpha_{s}\left\|\nabla \bar{w}\left(\nu_{s}\right)\right\|^{2} d s+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\nu} \bar{w}\left(\nu_{s}\right), \nabla_{\nu} \bar{w}\left(\nu_{s}\right)-\nabla_{\nu} w\left(X_{s}, \nu_{s}\right)\right\rangle d sw¯(νσ)w¯(ντ)=τσαsw¯(νs)2ds+τσαsνw¯(νs),νw¯(νs)νw(Xs,νs)ds
Applying exactly the same procedure as in Section 3 , lim t w ¯ ( ν t ) = 0 3 , lim t w ¯ ν t = 0 3,lim_(t rarr oo)||grad( bar(w))(nu_(t))||=03, \lim _{t \rightarrow \infty}\left\|\nabla \bar{w}\left(\nu_{t}\right)\right\|=03,limtw¯(νt)=0 as t t t rarr oot \rightarrow \inftyt with probability 1 . We omit the details as the proof is exactly the same as in Section 3.
Notice also that σ ( x ) σ ( x ) sigma^(**)(x)\sigma^{*}(x)σ(x) is not identifiable; for example, X t X t X_(t)X_{t}Xt has the same distribution under the diffusion coefficient σ ( x ) σ ( x ) -sigma^(**)(x)-\sigma^{*}(x)σ(x). Only σ ( x ) σ , ( x ) σ ( x ) σ , ( x ) sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x)σ(x)σ,(x) is identifiable. We are therefore essentially estimating a model σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) σ ( x , ν ) sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu)σ(x,ν)σ(x,ν) for σ ( x ) σ , ( x ) σ ( x ) σ , ( x ) sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x)σ(x)σ,(x).
We close this section with the following remark.
Remark 4.1. The proof of Theorem 2.4 makes it clear that if appropriate assumptions on θ f θ f grad_(theta)f\nabla_{\theta} fθf and g θ g θ grad_(g)theta\nabla_{g} \thetagθ are made such that sup t > 0 E | θ t q | < C sup t > 0 E θ t q < C s u p_(t > 0)E|theta_(t)^(q)| < C\sup _{t>0} \mathbb{E}\left|\theta_{t}^{q}\right|<Csupt>0E|θtq|<C for appropriate 0 < q , C < 0 < q , C < 0 < q,C < oo0<q, C<\infty0<q,C<, then one can relax Condition 2.3 on θ g θ g grad_(theta)g\nabla_{\theta} gθg to allow at least linear growth with respect to θ θ theta\thetaθ.

6. Model Estimation: Numerical Analysis

We implement SGDCT for several applications and numerically analyze the convergence. Section 5.1 studies continuous-time stochastic gradient descent for the Ornstein-Uhlenbeck process, which is widely used in finance, physics, and biology. Section 5.2 studies the multidimensional Ornstein-Uhlenbeck process. Section 5.3 estimates the diffusion coefficient in Burger's equation with continuous-time stochastic gradient descent. Burger's equation is a widely-used nonlinear partial differential equation which is important to fluid mechanics, acoustics, and aerodynamics. Burger's equation is extensively used in engineering. In Section 5.4. we show how SGDCT can be used for reinforcement learning. In the final example, the drift and volatility functions for the multidimensional CIR process are estimated. The CIR process is widely used in financial modeling.

6.1. Ornstein-Uhlenbeck process

The Ornstein-Uhlenbeck (OU) process X t R X t R X_(t)inRX_{t} \in \mathbb{R}XtR satisfies the stochastic differential equation:
d X t = c ( m X t ) d t + d W t . d X t = c m X t d t + d W t . dX_(t)=c(m-X_(t))dt+dW_(t).d X_{t}=c\left(m-X_{t}\right) d t+d W_{t} .dXt=c(mXt)dt+dWt.
We use continuous-time stochastic gradient descent to learn the parameters θ = ( c , m ) R 2 θ = ( c , m ) R 2 theta=(c,m)inR^(2)\theta=(c, m) \in \mathbb{R}^{2}θ=(c,m)R2.
For the numerical experiments, we use an Euler scheme with a time step of 10 2 10 2 10^(-2)10^{-2}102. The learning rate is α t = min ( α , α / t ) α t = min ( α , α / t ) alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t)αt=min(α,α/t) with α = 10 2 α = 10 2 alpha=10^(-2)\alpha=10^{-2}α=102. We simulate data from (5.1) for a particular θ θ theta^(**)\theta^{*}θ and the stochastic gradient descent attempts to learn a parameter θ t θ t theta_(t)\theta_{t}θt which fits the data well. θ t θ t theta_(t)\theta_{t}θt is the statistical estimate for θ θ theta^(**)\theta^{*}θ at time t t ttt. If the estimation is accurate, θ t θ t theta_(t)\theta_{t}θt should of course be close to θ θ theta^(**)\theta^{*}θ. This example can be placed in the form of the original class of equations (1.1) by setting f ( x , θ ) = c ( m x ) f ( x , θ ) = c ( m x ) f(x,theta)=c(m-x)f(x, \theta)=c(m-x)f(x,θ)=c(mx) and f ( x ) = f ( x , θ ) f ( x ) = f x , θ f^(**)(x)=f(x,theta^(**))f^{*}(x)=f\left(x, \theta^{*}\right)f(x)=f(x,θ).
We study 10,500 cases. For each case, a different θ θ theta^(**)\theta^{*}θ is generated uniformly at random in the range [ 1 , 2 ] × [ 1 , 2 ] [ 1 , 2 ] × [ 1 , 2 ] [1,2]xx[1,2][1,2] \times[1,2][1,2]×[1,2]. For each case, we solve for the parameter θ t θ t theta_(t)\theta_{t}θt over the time period [ 0 , T ] [ 0 , T ] [0,T][0, T][0,T] for T = 10 6 T = 10 6 T=10^(6)T=10^{6}T=106. To summarize: - For cases n = 1 n = 1 n=1\mathrm{n}=1n=1 to 10,500
  • Generate a random θ θ theta^(**)\theta^{*}θ in [ 1 , 2 ] × [ 1 , 2 ] [ 1 , 2 ] × [ 1 , 2 ] [1,2]xx[1,2][1,2] \times[1,2][1,2]×[1,2]
  • Simulate a single path of X t X t X_(t)X_{t}Xt given θ θ theta^(**)\theta^{*}θ and simultaneously solve for the path of θ t θ t theta_(t)\theta_{t}θt on [ 0 , T ] [ 0 , T ] [0,T][0, T][0,T]
The accuracy of θ t θ t theta_(t)\theta_{t}θt at times t = 10 2 , 10 3 , 10 4 , 10 5 t = 10 2 , 10 3 , 10 4 , 10 5 t=10^(2),10^(3),10^(4),10^(5)t=10^{2}, 10^{3}, 10^{4}, 10^{5}t=102,103,104,105, and 10 6 10 6 10^(6)10^{6}106 is reported in Table 1 . Figures 1 and 2 plot the mean error in percent and mean squared error (MSE) against time. In the table and figures, the "error" is | θ t n θ , n | θ t n θ , n |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right||θtnθ,n| where n n nnn represents the n n nnn-th case. The "error in percent" is 100 × | θ t n θ , n | | θ , n | 100 × θ t n θ , n θ , n 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}100×|θtnθ,n||θ,n|. The "mean error in percent" is the average of these errors, i.e. 100 N n = 1 N | θ t n θ , n | | θ , n | 100 N n = 1 N θ t n θ , n θ , n (100 )/(N)sum_(n=1)^(N)(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)\frac{100}{N} \sum_{n=1}^{N} \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}100Nn=1N|θtnθ,n||θ,n|.
Figure 1: Mean error in percent plotted against time. Time is in log scale.
Figure 2: Mean squared error plotted against time. Time is in log scale.
Error/Time 10 2 10 2 10^(2)10^{2}102 10 3 10 3 10^(3)10^{3}103 10 4 10 4 10^(4)10^{4}104 10 5 10 5 10^(5)10^{5}105 10 6 10 6 10^(6)10^{6}106
Maximum Error .604 .2615 .0936 .0349 .0105
99 % 99 % 99%99 \%99% quantile of error .368 .140 .0480 .0163 .00542
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error .470 .1874 .0670 .0225 .00772
Mean squared error 1.92 × 10 2 1.92 × 10 2 1.92 xx10^(-2)1.92 \times 10^{-2}1.92×102 2.28 × 10 3 2.28 × 10 3 2.28 xx10^(-3)2.28 \times 10^{-3}2.28×103 2.52 × 10 4 2.52 × 10 4 2.52 xx10^(-4)2.52 \times 10^{-4}2.52×104 2.76 × 10 5 2.76 × 10 5 2.76 xx10^(-5)2.76 \times 10^{-5}2.76×105 2.90 × 10 6 2.90 × 10 6 2.90 xx10^(-6)2.90 \times 10^{-6}2.90×106
Mean Error in percent 7.37 2.497 0.811 0.264 0.085
Maximum error in percent 59.92 20.37 5.367 1.79 0.567
99 % 99 % 99%99 \%99% quantile of error in percent 25.14 9.07 3.05 1.00 0.323
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error in percent 34.86 12.38 4.12 1.30 0.432
Error/Time 10^(2) 10^(3) 10^(4) 10^(5) 10^(6) Maximum Error .604 .2615 .0936 .0349 .0105 99% quantile of error .368 .140 .0480 .0163 .00542 99.9% quantile of error .470 .1874 .0670 .0225 .00772 Mean squared error 1.92 xx10^(-2) 2.28 xx10^(-3) 2.52 xx10^(-4) 2.76 xx10^(-5) 2.90 xx10^(-6) Mean Error in percent 7.37 2.497 0.811 0.264 0.085 Maximum error in percent 59.92 20.37 5.367 1.79 0.567 99% quantile of error in percent 25.14 9.07 3.05 1.00 0.323 99.9% quantile of error in percent 34.86 12.38 4.12 1.30 0.432| Error/Time | $10^{2}$ | $10^{3}$ | $10^{4}$ | $10^{5}$ | $10^{6}$ | | :---: | :---: | :---: | :---: | :---: | :---: | | Maximum Error | .604 | .2615 | .0936 | .0349 | .0105 | | $99 \%$ quantile of error | .368 | .140 | .0480 | .0163 | .00542 | | $99.9 \%$ quantile of error | .470 | .1874 | .0670 | .0225 | .00772 | | Mean squared error | $1.92 \times 10^{-2}$ | $2.28 \times 10^{-3}$ | $2.52 \times 10^{-4}$ | $2.76 \times 10^{-5}$ | $2.90 \times 10^{-6}$ | | Mean Error in percent | 7.37 | 2.497 | 0.811 | 0.264 | 0.085 | | Maximum error in percent | 59.92 | 20.37 | 5.367 | 1.79 | 0.567 | | $99 \%$ quantile of error in percent | 25.14 | 9.07 | 3.05 | 1.00 | 0.323 | | $99.9 \%$ quantile of error in percent | 34.86 | 12.38 | 4.12 | 1.30 | 0.432 |
Table 1: Error at different times for the estimate θ t θ t theta_(t)\theta_{t}θt of θ θ theta^(**)\theta^{*}θ across 10,500 cases. The "error" is | θ t n θ , n | θ t n θ , n |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right||θtnθ,n| where n n nnn represents the n n nnn-th case. The "error in percent" is 100 × | θ t n θ , n | | θ , n | 100 × θ t n θ , n θ , n 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}100×|θtnθ,n||θ,n|.
Finally, we also track the objective function g ¯ ( θ t ) g ¯ θ t bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right)g¯(θt) over time. Figure 3 plots the error g ¯ ( θ t ) g ¯ θ t bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right)g¯(θt) against time. Since the limiting distribution π ( x ) π ( x ) pi(x)\pi(x)π(x) of 5.2 is Gaussian with mean m m m^(**)m^{*}m and variance 1 2 c 1 2 c (1)/(2c^(**))\frac{1}{2 c^{*}}12c, we have that:
g ¯ ( θ ) = ( c ( m x ) c ( m x ) ) 2 π ( x ) d x = ( c m c m ) 2 + ( c c ) 2 ( 1 2 c + ( m ) 2 ) + 2 ( c m c m ) ( c c ) m g ¯ ( θ ) = c m x c ( m x ) 2 π ( x ) d x = c m c m 2 + c c 2 1 2 c + m 2 + 2 c m c m c c m {:[ bar(g)(theta)=int(c^(**)(m^(**)-x)-c(m-x))^(2)pi(x)dx],[=(c^(**)m^(**)-cm)^(2)+(c^(**)-c)^(2)((1)/(2c^(**))+(m^(**))^(2))+2(c^(**)m^(**)-cm)(c-c^(**))m^(**)]:}\begin{aligned} \bar{g}(\theta) & =\int\left(c^{*}\left(m^{*}-x\right)-c(m-x)\right)^{2} \pi(x) d x \\ & =\left(c^{*} m^{*}-c m\right)^{2}+\left(c^{*}-c\right)^{2}\left(\frac{1}{2 c^{*}}+\left(m^{*}\right)^{2}\right)+2\left(c^{*} m^{*}-c m\right)\left(c-c^{*}\right) m^{*} \end{aligned}g¯(θ)=(c(mx)c(mx))2π(x)dx=(cmcm)2+(cc)2(12c+(m)2)+2(cmcm)(cc)m
Figure 3: The error g ¯ ( θ t ) g ¯ θ t bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right)g¯(θt) plotted against time. The mean error and the quantiles of the error are calculated from the 10,500 cases. Time is in log scale.

6.2. Multidimensional Ornstein-Uhlenbeck process

The multidimensional Ornstein-Uhlenbeck process X t R d X t R d X_(t)inR^(d)X_{t} \in \mathbb{R}^{d}XtRd satisfies the stochastic differential equation:
d X t = ( M A X t ) d t + d W t . d X t = M A X t d t + d W t . dX_(t)=(M-AX_(t))dt+dW_(t).d X_{t}=\left(M-A X_{t}\right) d t+d W_{t} .dXt=(MAXt)dt+dWt.
We use continuous-time stochastic gradient descent to learn the parameters θ = ( M , A ) R d × R d × d θ = ( M , A ) R d × R d × d theta=(M,A)inR^(d)xxR^(d xx d)\theta=(M, A) \in \mathbb{R}^{d} \times \mathbb{R}^{d \times d}θ=(M,A)Rd×Rd×d. For the numerical experiments, we use an Euler scheme with a time step of 10 2 10 2 10^(-2)10^{-2}102. The learning rate is α t = min ( α , α / t ) α t = min ( α , α / t ) alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t)αt=min(α,α/t) with α = 10 1 α = 10 1 alpha=10^(-1)\alpha=10^{-1}α=101. We simulate data from 5.2 for a particular θ = ( M , A ) θ = M , A theta^(**)=(M^(**),A^(**))\theta^{*}=\left(M^{*}, A^{*}\right)θ=(M,A) and the stochastic gradient descent attempts to learn a parameter θ t θ t theta_(t)\theta_{t}θt which fits the data well. θ t θ t theta_(t)\theta_{t}θt is the statistical estimate for θ θ theta^(**)\theta^{*}θ at time t t ttt. If the estimation is accurate, θ t θ t theta_(t)\theta_{t}θt should of course be close to θ θ theta^(**)\theta^{*}θ. This example can be placed in the form of the original class of equations (1.1) by setting f ( x , θ ) = M A x f ( x , θ ) = M A x f(x,theta)=M-Axf(x, \theta)=M-A xf(x,θ)=MAx and f ( x ) = f ( x , θ ) f ( x ) = f x , θ f^(**)(x)=f(x,theta^(**))f^{*}(x)=f\left(x, \theta^{*}\right)f(x)=f(x,θ).
The matrix A A A^(**)A^{*}A must be generated carefully to ensure that X t X t X_(t)X_{t}Xt is ergodic and has a stable equilibrium point. If some of A A A^(**)A^{*}A 's eigenvalues have negative real parts, then X t X t X_(t)X_{t}Xt can become unstable and grow arbitrarily large. Therefore, we randomly generate matrices A A A^(**)A^{*}A which are strictly diagonally dominant. A A A^(**)A^{*}A 's eigenvalues are therefore guaranteed to have positive real parts and X t X t X_(t)X_{t}Xt will be ergodic. To generate random strictly diagonally dominant matrices A A A^(**)A^{*}A, we first generate A i , j A i , j A_(i,j)^(**)A_{i, j}^{*}Ai,j uniformly at random in the range [1,2] for i j i j i!=ji \neq jij. Then, we set A i , i = j i A i , j + U i , i A i , i = j i A i , j + U i , i A_(i,i)^(**)=sum_(j!=i)A_(i,j)^(**)+U_(i,i)A_{i, i}^{*}=\sum_{j \neq i} A_{i, j}^{*}+U_{i, i}Ai,i=jiAi,j+Ui,i where U i , i U i , i U_(i,i)U_{i, i}Ui,i is generated randomly in [ 1 , 2 ] . M i [ 1 , 2 ] . M i [1,2].M_(i)^(**)[1,2] . M_{i}^{*}[1,2].Mi for i = 1 , , d i = 1 , , d i=1,dots,di=1, \ldots, di=1,,d is also generated randomly in [ 1 , 2 ] [ 1 , 2 ] [1,2][1,2][1,2].
We study 525 cases and analyze the error in Table 2 Figures 4 and 5 plot the error over time.
Error/Time 10 2 10 2 10^(2)10^{2}102 10 3 10 3 10^(3)10^{3}103 10 4 10 4 10^(4)10^{4}104 10 5 10 5 10^(5)10^{5}105 10 6 10 6 10^(6)10^{6}106
Maximum Error 2.89 .559 .151 .043 .013
99 % 99 % 99%99 \%99% quantile of error 2.19 .370 .0957 .0294 .00911
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error 2.57 .481 .118 .0377 .0117
Mean squared error 8.05 × 10 1 8.05 × 10 1 8.05 xx10^(-1)8.05 \times 10^{-1}8.05×101 2.09 × 10 2 2.09 × 10 2 2.09 xx10^(-2)2.09 \times 10^{-2}2.09×102 1.38 × 10 3 1.38 × 10 3 1.38 xx10^(-3)1.38 \times 10^{-3}1.38×103 1.29 × 10 4 1.29 × 10 4 1.29 xx10^(-4)1.29 \times 10^{-4}1.29×104 1.25 × 10 5 1.25 × 10 5 1.25 xx10^(-5)1.25 \times 10^{-5}1.25×105
Mean Error in percent 34.26 6.18 1.68 0.52 0.161
Maximum error in percent 186.3 41.68 10.98 3.81 1.03
99 % 99 % 99%99 \%99% quantile of error in percent 109.2 23.9 6.98 2.15 0.657
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error in percent 141.2 31.24 8.64 2.84 0.879
Error/Time 10^(2) 10^(3) 10^(4) 10^(5) 10^(6) Maximum Error 2.89 .559 .151 .043 .013 99% quantile of error 2.19 .370 .0957 .0294 .00911 99.9% quantile of error 2.57 .481 .118 .0377 .0117 Mean squared error 8.05 xx10^(-1) 2.09 xx10^(-2) 1.38 xx10^(-3) 1.29 xx10^(-4) 1.25 xx10^(-5) Mean Error in percent 34.26 6.18 1.68 0.52 0.161 Maximum error in percent 186.3 41.68 10.98 3.81 1.03 99% quantile of error in percent 109.2 23.9 6.98 2.15 0.657 99.9% quantile of error in percent 141.2 31.24 8.64 2.84 0.879| Error/Time | $10^{2}$ | $10^{3}$ | $10^{4}$ | $10^{5}$ | $10^{6}$ | | :---: | :---: | :---: | :---: | :---: | :---: | | Maximum Error | 2.89 | .559 | .151 | .043 | .013 | | $99 \%$ quantile of error | 2.19 | .370 | .0957 | .0294 | .00911 | | $99.9 \%$ quantile of error | 2.57 | .481 | .118 | .0377 | .0117 | | Mean squared error | $8.05 \times 10^{-1}$ | $2.09 \times 10^{-2}$ | $1.38 \times 10^{-3}$ | $1.29 \times 10^{-4}$ | $1.25 \times 10^{-5}$ | | Mean Error in percent | 34.26 | 6.18 | 1.68 | 0.52 | 0.161 | | Maximum error in percent | 186.3 | 41.68 | 10.98 | 3.81 | 1.03 | | $99 \%$ quantile of error in percent | 109.2 | 23.9 | 6.98 | 2.15 | 0.657 | | $99.9 \%$ quantile of error in percent | 141.2 | 31.24 | 8.64 | 2.84 | 0.879 |
Table 2: Error at different times for the estimate θ t θ t theta_(t)\theta_{t}θt of θ θ theta^(**)\theta^{*}θ across 525 cases. The "error" is | θ t n θ , n | θ t n θ , n |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right||θtnθ,n| where n n nnn represents the n n nnn-th case. The "error in percent" is 100 × | θ t n θ , n | | θ , n | 100 × θ t n θ , n θ , n 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}100×|θtnθ,n||θ,n|.
Figure 4: Mean error in percent plotted against time. Time is in log scale.
Figure 5: Mean squared error plotted against time. Time is in log scale.

6.3. Burger's Equation

The stochastic Burger's equation that we consider is given by:
u t ( t , x ) = θ 2 u x 2 u ( t , x ) u x ( t , x ) + σ 2 W ( t , x ) t x , u t ( t , x ) = θ 2 u x 2 u ( t , x ) u x ( t , x ) + σ 2 W ( t , x ) t x , (del u)/(del t)(t,x)=theta(del^(2)u)/(delx^(2))-u(t,x)(del u)/(del x)(t,x)+sigma(del^(2)W(t,x))/(del t del x),\frac{\partial u}{\partial t}(t, x)=\theta \frac{\partial^{2} u}{\partial x^{2}}-u(t, x) \frac{\partial u}{\partial x}(t, x)+\sigma \frac{\partial^{2} W(t, x)}{\partial t \partial x},ut(t,x)=θ2ux2u(t,x)ux(t,x)+σ2W(t,x)tx,
where x [ 0 , 1 ] x [ 0 , 1 ] x in[0,1]x \in[0,1]x[0,1] and W ( t , x ) W ( t , x ) W(t,x)W(t, x)W(t,x) is a Brownian sheet. The finite-difference discretization of ( 5.3 ) ( 5.3 ) (5.3)(5.3)(5.3) satisfies a system of nonlinear stochastic differential equations (for instance, see [13] or [3]). We use continuous-time stochastic gradient descent to learn the diffusion parameter θ θ theta\thetaθ.
We use the following finite difference scheme for Burger's equation:
d u ( t , x i ) = θ u ( t , x i + 1 ) 2 u ( t , x i ) + u ( t , x i 1 ) Δ x 2 d t u ( t , x i ) u ( t , x i + 1 ) u ( t , x i 1 ) 2 Δ x d t + σ Δ x d W t i , d u t , x i = θ u t , x i + 1 2 u t , x i + u t , x i 1 Δ x 2 d t u t , x i u t , x i + 1 u t , x i 1 2 Δ x d t + σ Δ x d W t i , du(t,x_(i))=theta(u(t,x_(i+1))-2u(t,x_(i))+u(t,x_(i-1)))/(Deltax^(2))dt-u(t,x_(i))(u(t,x_(i+1))-u(t,x_(i-1)))/(2Delta x)dt+(sigma)/(sqrt(Delta x))dW_(t)^(i),d u\left(t, x_{i}\right)=\theta \frac{u\left(t, x_{i+1}\right)-2 u\left(t, x_{i}\right)+u\left(t, x_{i-1}\right)}{\Delta x^{2}} d t-u\left(t, x_{i}\right) \frac{u\left(t, x_{i+1}\right)-u\left(t, x_{i-1}\right)}{2 \Delta x} d t+\frac{\sigma}{\sqrt{\Delta x}} d W_{t}^{i},du(t,xi)=θu(t,xi+1)2u(t,xi)+u(t,xi1)Δx2dtu(t,xi)u(t,xi+1)u(t,xi1)2Δxdt+σΔxdWti,
For our numerical experiment, the boundary conditions u ( t , x = 0 ) = 0 u ( t , x = 0 ) = 0 u(t,x=0)=0u(t, x=0)=0u(t,x=0)=0 and u ( t , x = 1 ) = 1 u ( t , x = 1 ) = 1 u(t,x=1)=1u(t, x=1)=1u(t,x=1)=1 are used and σ = 0.1 σ = 0.1 sigma=0.1\sigma=0.1σ=0.1. (5.4) is simulated with the Euler scheme (i.e., we solve Burger's equation with explicit finite difference). A spatial discretization of Δ x = .01 Δ x = .01 Delta x=.01\Delta x=.01Δx=.01 and a time step of 10 5 10 5 10^(-5)10^{-5}105 are used. The learning rate is α t = min ( α , α / t ) α t = min ( α , α / t ) alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t)αt=min(α,α/t) with α = 10 3 α = 10 3 alpha=10^(-3)\alpha=10^{-3}α=103. The small time step is needed to avoid instability in the explicit finite difference scheme. We simulate data from 5.3 for a particular diffusion coefficient θ θ theta^(**)\theta^{*}θ and the stochastic gradient descent attempts to learn a diffusion parameter θ t θ t theta_(t)\theta_{t}θt which fits the data well. θ t θ t theta_(t)\theta_{t}θt is the statistical estimate for θ θ theta^(**)\theta^{*}θ at time t t ttt. If the estimation is accurate, θ t θ t theta_(t)\theta_{t}θt should of course be close to θ θ theta^(**)\theta^{*}θ.
This example can be placed in the form of the original class of equations (1.1). Let f i f i f_(i)f_{i}fi be the i i iii-th element of the function f f fff. Then, f i ( u , θ ) = θ u ( t , x i + 1 ) 2 u ( t , x i ) + u ( t , x i 1 ) Δ x 2 u ( t , x i ) u ( t , x i + 1 ) u ( t , x i 1 ) 2 Δ x f i ( u , θ ) = θ u t , x i + 1 2 u t , x i + u t , x i 1 Δ x 2 u t , x i u t , x i + 1 u t , x i 1 2 Δ x f_(i)(u,theta)=theta(u(t,x_(i+1))-2u(t,x_(i))+u(t,x_(i-1)))/(Deltax^(2))-u(t,x_(i))(u(t,x_(i+1))-u(t,x_(i-1)))/(2Delta x)f_{i}(u, \theta)=\theta \frac{u\left(t, x_{i+1}\right)-2 u\left(t, x_{i}\right)+u\left(t, x_{i-1}\right)}{\Delta x^{2}}-u\left(t, x_{i}\right) \frac{u\left(t, x_{i+1}\right)-u\left(t, x_{i-1}\right)}{2 \Delta x}fi(u,θ)=θu(t,xi+1)2u(t,xi)+u(t,xi1)Δx2u(t,xi)u(t,xi+1)u(t,xi1)2Δx. Similarly, let f i f i f_(i)^(**)f_{i}^{*}fi be the i i iii-th element of the function f f f^(**)f^{*}f. Then, f i ( u ) = f i ( u , θ ) f i ( u ) = f i u , θ f_(i)^(**)(u)=f_(i)(u,theta^(**))f_{i}^{*}(u)=f_{i}\left(u, \theta^{*}\right)fi(u)=fi(u,θ).
We study 525 cases. For each case, a different θ θ theta^(**)\theta^{*}θ is generated uniformly at random in the range [.1,10]. This represents a wide range of physical cases of interest, with θ θ theta^(**)\theta^{*}θ ranging over two orders of magnitude. For each case, we solve for the parameter θ t θ t theta_(t)\theta_{t}θt over the time period [ 0 , T ] [ 0 , T ] [0,T][0, T][0,T] for T = 100 T = 100 T=100T=100T=100.
The accuracy of θ t θ t theta_(t)\theta_{t}θt at times t = 10 1 , 10 0 , 10 1 t = 10 1 , 10 0 , 10 1 t=10^(-1),10^(0),10^(1)t=10^{-1}, 10^{0}, 10^{1}t=101,100,101, and 10 2 10 2 10^(2)10^{2}102 is reported in Table 3 . Figures 6 and 7 plot the mean error in percent and mean squared error against time. The convergence of θ t θ t theta_(t)\theta_{t}θt to θ θ theta^(**)\theta^{*}θ is fairly rapid in time.
Error/Time 10 1 10 1 10^(-1)10^{-1}101 10 0 10 0 10^(0)10^{0}100 10 1 10 1 10^(1)10^{1}101 10 2 10 2 10^(2)10^{2}102
Maximum Error .1047 .106 .033 .0107
99% quantile of error .08 .078 .0255 .00835
Mean squared error 1.00 × 10 3 1.00 × 10 3 1.00 xx10^(-3)1.00 \times 10^{-3}1.00×103 9.25 × 10 4 9.25 × 10 4 9.25 xx10^(-4)9.25 \times 10^{-4}9.25×104 1.02 × 10 4 1.02 × 10 4 1.02 xx10^(-4)1.02 \times 10^{-4}1.02×104 1.12 × 10 5 1.12 × 10 5 1.12 xx10^(-5)1.12 \times 10^{-5}1.12×105
Mean Error in percent 1.26 1.17 0.4 0.13
Maximum error in percent 37.1 37.5 9.82 4.73
99% quantile of error in percent 12.6 18.0 5.64 1.38
Error/Time 10^(-1) 10^(0) 10^(1) 10^(2) Maximum Error .1047 .106 .033 .0107 99% quantile of error .08 .078 .0255 .00835 Mean squared error 1.00 xx10^(-3) 9.25 xx10^(-4) 1.02 xx10^(-4) 1.12 xx10^(-5) Mean Error in percent 1.26 1.17 0.4 0.13 Maximum error in percent 37.1 37.5 9.82 4.73 99% quantile of error in percent 12.6 18.0 5.64 1.38| Error/Time | $10^{-1}$ | $10^{0}$ | $10^{1}$ | $10^{2}$ | | :---: | :---: | :---: | :---: | :---: | | Maximum Error | .1047 | .106 | .033 | .0107 | | 99% quantile of error | .08 | .078 | .0255 | .00835 | | Mean squared error | $1.00 \times 10^{-3}$ | $9.25 \times 10^{-4}$ | $1.02 \times 10^{-4}$ | $1.12 \times 10^{-5}$ | | Mean Error in percent | 1.26 | 1.17 | 0.4 | 0.13 | | Maximum error in percent | 37.1 | 37.5 | 9.82 | 4.73 | | 99% quantile of error in percent | 12.6 | 18.0 | 5.64 | 1.38 |
Table 3: Error at different times for the estimate θ t θ t theta_(t)\theta_{t}θt of θ θ theta^(**)\theta^{*}θ across 525 cases. The "error" is | θ t n θ , n | θ t n θ , n |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right||θtnθ,n| where n n nnn represents the n n nnn-th case. The "error in percent" is 100 × | θ t n θ , n | / | θ , n | 100 × θ t n θ , n / θ , n 100 xx|theta_(t)^(n)-theta^(**,n)|//|theta^(**,n)|100 \times\left|\theta_{t}^{n}-\theta^{*, n}\right| /\left|\theta^{*, n}\right|100×|θtnθ,n|/|θ,n|.
Figure 6: Mean error in percent plotted against time. Time is in log scale.

6.4. Reinforcement Learning

We consider the classic reinforcement learning problem of balancing a pole on a moving cart (see [6]). The goal is to balance a pole on a cart and to keep the cart from moving outside the boundaries via applying a force of \pm 10 Newtons.
The position x x xxx of the cart, the velocity x ˙ x ˙ x^(˙)\dot{x}x˙ of the cart, angle of the pole β β beta\betaβ, and angular velocity β ˙ β ˙ beta^(˙)\dot{\beta}β˙ of the pole are observed. The dynamics of s = ( x , x ˙ , β , β ˙ ) s = ( x , x ˙ , β , β ˙ ) s=(x,x^(˙),beta,beta^(˙))s=(x, \dot{x}, \beta, \dot{\beta})s=(x,x˙,β,β˙) satisfy a set of ODEs (see [6]):
β ¨ t = g sin β t + cos β t [ F t m l β ˙ t 2 sin β t + μ c sgn ( x ˙ t ) m c + m ] μ p β ˙ t m l l [ 4 3 m m c cos 2 β t m c + m ] , x ¨ t = F t + m l [ β ˙ t 2 sin β t β ¨ t cos β t ] μ c sgn ( x ˙ t ) m c + m , β ¨ t = g sin β t + cos β t F t m l β ˙ t 2 sin β t + μ c sgn x ˙ t m c + m μ p β ˙ t m l l 4 3 m m c cos 2 β t m c + m , x ¨ t = F t + m l β ˙ t 2 sin β t β ¨ t cos β t μ c sgn x ˙ t m c + m , {:[beta^(¨)_(t)=(g sin beta_(t)+cos beta_(t)[(-F_(t)-mlbeta^(˙)_(t)^(2)sin beta_(t)+mu_(c)sgn(x^(˙)_(t)))/(m_(c)+m)]-(mu_(p)beta^(˙)_(t))/(ml))/(l[(4)/(3)-(m)/(m_(c))(cos^(2)beta_(t))/(m_(c)+m)])","],[x^(¨)_(t)=(F_(t)+ml[beta^(˙)_(t)^(2)sin beta_(t)-beta^(¨)_(t)cos beta_(t)]-mu_(c)sgn(x^(˙)_(t)))/(m_(c)+m)","]:}\begin{aligned} & \ddot{\beta}_{t}=\frac{g \sin \beta_{t}+\cos \beta_{t}\left[\frac{-F_{t}-m l \dot{\beta}_{t}^{2} \sin \beta_{t}+\mu_{c} \operatorname{sgn}\left(\dot{x}_{t}\right)}{m_{c}+m}\right]-\frac{\mu_{p} \dot{\beta}_{t}}{m l}}{l\left[\frac{4}{3}-\frac{m}{m_{c}} \frac{\cos ^{2} \beta_{t}}{m_{c}+m}\right]}, \\ & \ddot{x}_{t}=\frac{F_{t}+m l\left[\dot{\beta}_{t}^{2} \sin \beta_{t}-\ddot{\beta}_{t} \cos \beta_{t}\right]-\mu_{c} \operatorname{sgn}\left(\dot{x}_{t}\right)}{m_{c}+m}, \end{aligned}β¨t=gsinβt+cosβt[Ftmlβ˙t2sinβt+μcsgn(x˙t)mc+m]μpβ˙tmll[43mmccos2βtmc+m],x¨t=Ft+ml[β˙t2sinβtβ¨tcosβt]μcsgn(x˙t)mc+m,
where g g ggg is the acceleration due to gravity, m c m c m_(c)m_{c}mc is the mass of the cart, m m mmm is the mass of the pole, 2 l 2 l 2l2 l2l is the length of the pole, μ c μ c mu_(c)\mu_{c}μc is the coefficient of friction of the cart on the ground, μ p μ p mu_(p)\mu_{p}μp is the coefficient of friction of the pole on the cart, and F t { 10 , 10 } F t { 10 , 10 } F_(t)in{-10,10}F_{t} \in\{-10,10\}Ft{10,10} is the force applied to the cart.
For this example, f ( s ) = ( x ˙ , x ¨ , β ˙ , β ¨ ) f ( s ) = ( x ˙ , x ¨ , β ˙ , β ¨ ) f^(**)(s)=(x^(˙),x^(¨),beta^(˙),beta^(¨))f^{*}(s)=(\dot{x}, \ddot{x}, \dot{\beta}, \ddot{\beta})f(s)=(x˙,x¨,β˙,β¨). The model f ( s , θ ) = ( f 1 ( s , θ ) , f 2 ( s , θ ) , f 3 ( s , θ ) , f 4 ( s , θ ) ) f ( s , θ ) = f 1 ( s , θ ) , f 2 ( s , θ ) , f 3 ( s , θ ) , f 4 ( s , θ ) f(s,theta)=(f_(1)(s,theta),f_(2)(s,theta),f_(3)(s,theta),f_(4)(s,theta))f(s, \theta)=\left(f_{1}(s, \theta), f_{2}(s, \theta), f_{3}(s, \theta), f_{4}(s, \theta)\right)f(s,θ)=(f1(s,θ),f2(s,θ),f3(s,θ),f4(s,θ)) where f i ( s , θ ) f i ( s , θ ) f_(i)(s,theta)f_{i}(s, \theta)fi(s,θ) is a single-layer neural network with rectified linear units.
f i ( s , θ ) = W 2 , i h ( W 1 , i s + b 1 , i ) + b 2 , i , f i ( s , θ ) = W 2 , i h W 1 , i s + b 1 , i + b 2 , i , f_(i)(s,theta)=W^(2,i)h(W^(1,i)s+b^(1,i))+b^(2,i),f_{i}(s, \theta)=W^{2, i} h\left(W^{1, i} s+b^{1, i}\right)+b^{2, i},fi(s,θ)=W2,ih(W1,is+b1,i)+b2,i,
Figure 7: Mean squared error plotted against time. Time is in log scale.
where θ = { W 2 , i , W 1 , i , b 1 , i , b 2 , i } i = 1 4 θ = W 2 , i , W 1 , i , b 1 , i , b 2 , i i = 1 4 theta={W^(2,i),W^(1,i),b^(1,i),b^(2,i)}_(i=1)^(4)\theta=\left\{W^{2, i}, W^{1, i}, b^{1, i}, b^{2, i}\right\}_{i=1}^{4}θ={W2,i,W1,i,b1,i,b2,i}i=14 and h ( z ) = ( σ ( z 1 ) , , σ ( z d ) ) h ( z ) = σ z 1 , , σ z d h(z)=(sigma(z_(1)),dots,sigma(z_(d)))h(z)=\left(\sigma\left(z_{1}\right), \ldots, \sigma\left(z_{d}\right)\right)h(z)=(σ(z1),,σ(zd)) for z R d z R d z inR^(d)z \in \mathbb{R}^{d}zRd. The function σ : R R σ : R R sigma:RrarrR\sigma: \mathbb{R} \rightarrow \mathbb{R}σ:RR is a rectified linear unit (ReLU): σ ( v ) = max ( v , 0 ) σ ( v ) = max ( v , 0 ) sigma(v)=max(v,0)\sigma(v)=\max (v, 0)σ(v)=max(v,0). We learn the parameter θ θ theta\thetaθ using continuous-time stochastic gradient descent.
The boundary is x = ± 2.4 x = ± 2.4 x=+-2.4x= \pm 2.4x=±2.4 meters and the pole must not be allowed to fall below β = 24 360 π β = 24 360 π beta=(24)/(360 pi)\beta=\frac{24}{360 \pi}β=24360π radians (the frame of reference is chosen such that the perfectly upright is 0 radians). A reward of +1 is received every 0.02 seconds if x 2.4 x 2.4 ||x|| <= 2.4\|x\| \leq 2.4x2.4 and θ 24 360 π θ 24 360 π ||theta|| <= (24)/(360 pi)\|\theta\| \leq \frac{24}{360 \pi}θ24360π. A reward of -100 is received (and the episode ends) if the cart moves beyond x = ± 2.4 x = ± 2.4 x=+-2.4x= \pm 2.4x=±2.4 or the pole falls below β = 24 360 π β = 24 360 π beta=(24)/(360 pi)\beta=\frac{24}{360 \pi}β=24360π radians. The sum of these rewards across the entire episode is the reward for that episode. The initial state ( x , x ˙ , β , β ˙ ) ( x , x ˙ , β , β ˙ ) (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta})(x,x˙,β,β˙) at the start of an episode is generated uniformly at random in [ .05 , .05 ] 4 [ .05 , .05 ] 4 [-.05,.05]^(4)[-.05, .05]^{4}[.05,.05]4. For our numerical experiment, we assume that the rule for receiving the rewards and the distribution of the initial state are both known. An action of \pm 10 Newtons may be chosen every 0.02 seconds. This force is then applied for the duration of the next 0.02 seconds. The system (5.5) is simulated using an Euler scheme with a time step size of 10 3 10 3 10^(-3)10^{-3}103 seconds.
The goal, of course, is to statistically learn the optimal actions in order to achieve the highest possible reward. This requires both: 1) statistically learning the physical dynamics of ( x , x ˙ , β , β ˙ ) ( x , x ˙ , β , β ˙ ) (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta})(x,x˙,β,β˙) and 2 ) ) ))) finding the optimal actions given these dynamics in order to achieve the highest possible reward. The dynamics ( x , x ˙ , β , β ˙ ) ( x , x ˙ , β , β ˙ ) (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta})(x,x˙,β,β˙) satisfy the set of ODEs ( 5.5 ) ( 5.5 ) (5.5)(5.5)(5.5); these dynamics can be learned using continuous-time stochastic gradient descent. We use a neural network for f f fff. Given the estimated dynamics f f fff, we use a policy gradient method to estimate the optimal actions. The approach is summarized below.
  • For episodes 0 , 1 , 2 , 0 , 1 , 2 , 0,1,2,dots0,1,2, \ldots0,1,2, :
  • For time [ 0 , T end of episode ] 0 , T end of episode  [0,T_("end of episode ")]\left[0, T_{\text {end of episode }}\right][0,Tend of episode ] :
  • Update the model f ( s , θ ) f ( s , θ ) f(s,theta)f(s, \theta)f(s,θ) for the dynamics using continuous-time stochastic gradient descent.
  • Periodically update the optimal policy μ ( s , a , θ μ ) μ s , a , θ μ mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right)μ(s,a,θμ) using policy gradient method. The optimal policy is learned using data simulated from the model f ( s , θ ) f ( s , θ ) f(s,theta)f(s, \theta)f(s,θ). Actions are randomly selected via the policy μ μ mu\muμ.
The policy μ μ mu\muμ is a neural network with parameters θ μ θ μ theta^(mu)\theta^{\mu}θμ. We use a single hidden layer with rectified linear units followed by a softmax layer for μ ( s , a , θ μ ) μ s , a , θ μ mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right)μ(s,a,θμ) and train it using policy gradients 1 The policy μ ( s , a , θ μ ) μ s , a , θ μ mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right)μ(s,a,θμ)
gives the probability of taking action a a aaa conditional on being in the state s s sss.
P [ F t = 10 s t = s ] = μ ( s t , 10 , θ μ ) = σ 0 ( W 2 h ( W 1 s + b 1 ) + b 2 ) , P F t = 10 s t = s = μ s t , 10 , θ μ = σ 0 W 2 h W 1 s + b 1 + b 2 , P[F_(t)=10∣s_(t)=s]=mu(s_(t),10,theta^(mu))=sigma_(0)(W^(2)h(W^(1)s+b^(1))+b^(2)),\mathbb{P}\left[F_{t}=10 \mid s_{t}=s\right]=\mu\left(s_{t}, 10, \theta^{\mu}\right)=\sigma_{0}\left(W^{2} h\left(W^{1} s+b^{1}\right)+b^{2}\right),P[Ft=10st=s]=μ(st,10,θμ)=σ0(W2h(W1s+b1)+b2),
where σ 0 ( v ) = e v 1 + e v σ 0 ( v ) = e v 1 + e v sigma_(0)(v)=(e^(v))/(1+e^(v))\sigma_{0}(v)=\frac{e^{v}}{1+e^{v}}σ0(v)=ev1+ev. Of course, P [ F t = 10 s t = s ] = μ ( s , 10 , θ μ ) = 1 μ ( s , 10 , θ μ ) P F t = 10 s t = s = μ s , 10 , θ μ = 1 μ s , 10 , θ μ P[F_(t)=-10∣s_(t)=s]=mu(s,-10,theta^(mu))=1-mu(s,10,theta^(mu))\mathbb{P}\left[F_{t}=-10 \mid s_{t}=s\right]=\mu\left(s,-10, \theta^{\mu}\right)=1-\mu\left(s, 10, \theta^{\mu}\right)P[Ft=10st=s]=μ(s,10,θμ)=1μ(s,10,θμ).
525 cases are run, each for 25 hours. The optimal policy is learned using the estimated dynamics f ( s , θ ) f ( s , θ ) f(s,theta)f(s, \theta)f(s,θ) and is updated every 5 episodes. Table 4 reports the results at fixed episodes using continuous-time stochastic gradient descent. Table 5 reports statistics on the number of episodes required until a target episodic reward ( 100 , 500 , 1000 ) ( 100 , 500 , 1000 ) (100,500,1000)(100,500,1000)(100,500,1000) is first achieved.
Reward/Episode 10 20 30 40 45
Maximum Reward -20 981 2.21 × 10 4 2.21 × 10 4 2.21 xx10^(4)2.21 \times 10^{4}2.21×104 6.64 × 10 5 6.64 × 10 5 6.64 xx10^(5)6.64 \times 10^{5}6.64×105 9.22 × 10 5 9.22 × 10 5 9.22 xx10^(5)9.22 \times 10^{5}9.22×105
90% quantile of reward -63 184 760 8354 1.5 × 10 4 1.5 × 10 4 1.5 xx10^(4)1.5 \times 10^{4}1.5×104
Mean reward -78 67 401 5659 1.22 × 10 4 1.22 × 10 4 1.22 xx10^(4)1.22 \times 10^{4}1.22×104
10% quantile of reward -89 -34 36 69 93
Minimum reward -92 -82 -61 -46 -23
Reward/Episode 10 20 30 40 45 Maximum Reward -20 981 2.21 xx10^(4) 6.64 xx10^(5) 9.22 xx10^(5) 90% quantile of reward -63 184 760 8354 1.5 xx10^(4) Mean reward -78 67 401 5659 1.22 xx10^(4) 10% quantile of reward -89 -34 36 69 93 Minimum reward -92 -82 -61 -46 -23| Reward/Episode | 10 | 20 | 30 | 40 | 45 | | :---: | :---: | :---: | :---: | :---: | :---: | | Maximum Reward | -20 | 981 | $2.21 \times 10^{4}$ | $6.64 \times 10^{5}$ | $9.22 \times 10^{5}$ | | 90% quantile of reward | -63 | 184 | 760 | 8354 | $1.5 \times 10^{4}$ | | Mean reward | -78 | 67 | 401 | 5659 | $1.22 \times 10^{4}$ | | 10% quantile of reward | -89 | -34 | 36 | 69 | 93 | | Minimum reward | -92 | -82 | -61 | -46 | -23 |
Table 4: Reward at the k k kkk-th episode across the 525 cases using continuous-time stochastic gradient descent to learn the model dynamics.
Number of episodes/Target reward 100 500 1000
Maximum 39 134 428
90% quantile 23 49 61
Mean 18 34 43
10% quantile 13 21 26
Minimum 11 14 17
Number of episodes/Target reward 100 500 1000 Maximum 39 134 428 90% quantile 23 49 61 Mean 18 34 43 10% quantile 13 21 26 Minimum 11 14 17| Number of episodes/Target reward | 100 | 500 | 1000 | | :---: | :---: | :---: | :---: | | Maximum | 39 | 134 | 428 | | 90% quantile | 23 | 49 | 61 | | Mean | 18 | 34 | 43 | | 10% quantile | 13 | 21 | 26 | | Minimum | 11 | 14 | 17 |
Table 5: For each case, we record the number of episodes required until the target reward is first achieved using continuous-time stochastic gradient descent. Statistics (maximum, quantiles, mean, minimum) for the number of episodes required until the target reward is first achieved.
Alternatively, one could directly apply policy gradient to learn the optimal action using the observed data. This approach does not use continuous-time stochastic gradient descent to learn the model dynamics, but instead directly learns the optimal policy from the data. Again using 525 cases, we report the results in Table 6 for directly learning the optimal policy without using continuous-time stochastic gradient descent to learn the model dynamics. Comparing Tables 4 and 6 , it is clear that using continuous-time stochastic gradient descent to learn the model dynamics allows for the optimal policy to be learned significantly more quickly. The rewards are much higher when using continuous-time stochastic gradient descent (see Table 4 ) than when not using it (see Table6).
Reward/Episode 10 20 30 40 100 500 750
Maximum Reward 51 1 15 77 121 1748 1.91 × 10 5 1.91 × 10 5 1.91 xx10^(5)1.91 \times 10^{5}1.91×105
90% quantile of reward -52 -48 -42 8354 -11 345 2314
Mean reward -73 -72 -69 -68 -53 150 1476
10% quantile of reward -88 -88 -87 69 -83 -1 63
Minimum reward -92 -92 -92 -92 -92 -81 -74
Reward/Episode 10 20 30 40 100 500 750 Maximum Reward 51 1 15 77 121 1748 1.91 xx10^(5) 90% quantile of reward -52 -48 -42 8354 -11 345 2314 Mean reward -73 -72 -69 -68 -53 150 1476 10% quantile of reward -88 -88 -87 69 -83 -1 63 Minimum reward -92 -92 -92 -92 -92 -81 -74| Reward/Episode | 10 | 20 | 30 | 40 | 100 | 500 | 750 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Maximum Reward | 51 | 1 | 15 | 77 | 121 | 1748 | $1.91 \times 10^{5}$ | | 90% quantile of reward | -52 | -48 | -42 | 8354 | -11 | 345 | 2314 | | Mean reward | -73 | -72 | -69 | -68 | -53 | 150 | 1476 | | 10% quantile of reward | -88 | -88 | -87 | 69 | -83 | -1 | 63 | | Minimum reward | -92 | -92 | -92 | -92 | -92 | -81 | -74 |
Table 6: Reward at the k k kkk-th episode across the 525 cases using policy gradient to learn the optimal policy.

6.5. Estimating both the drift and volatility functions for the multidimensional CIR process

We now implement an example where SGDCT is used to estimate both the drift function and the volatility function. The multidimensional CIR process X t R d X t R d X_(t)inR^(d)X_{t} \in \mathbb{R}^{d}XtRd is:
d X t = c ( m X t ) d t + X t σ d W t , d X t = c m X t d t + X t σ d W t , dX_(t)=c(m-X_(t))dt+sqrt(X_(t))o.sigma dW_(t),d X_{t}=c\left(m-X_{t}\right) d t+\sqrt{X_{t}} \odot \sigma d W_{t},dXt=c(mXt)dt+XtσdWt,
where o.\odot is element-wise multiplication, m R d , c , σ R d × d , W t R d m R d , c , σ R d × d , W t R d m inR^(d),c,sigma inR^(d xx d),W_(t)inR^(d)m \in \mathbb{R}^{d}, c, \sigma \in \mathbb{R}^{d \times d}, W_{t} \in \mathbb{R}^{d}mRd,c,σRd×d,WtRd, with c c ccc being a positive definite matrix. The CIR process is often used for modeling interest rates.
In equation (5.8), f ( x , θ ) = c ( m x ) f ( x , θ ) = c ( m x ) f(x,theta)=c(m-x)f(x, \theta)=c(m-x)f(x,θ)=c(mx) where θ = ( c , m ) . f ( x ) = f ( x , θ ) θ = ( c , m ) . f ( x ) = f x , θ theta=(c,m).f^(**)(x)=f(x,theta^(**))\theta=(c, m) . f^{*}(x)=f\left(x, \theta^{*}\right)θ=(c,m).f(x)=f(x,θ) where θ = ( c , m ) θ = c , m theta^(**)=(c^(**),m^(**))\theta^{*}=\left(c^{*}, m^{*}\right)θ=(c,m). The volatility model is σ ( x , ν ) = x ν σ ( x , ν ) = x ν sigma(x,nu)=sqrtxo.nu\sigma(x, \nu)=\sqrt{x} \odot \nuσ(x,ν)=xν and σ ( x ) = σ ( x , ν ) σ ( x ) = σ x , ν sigma^(**)(x)=sigma(x,nu^(**))\sigma^{*}(x)=\sigma\left(x, \nu^{*}\right)σ(x)=σ(x,ν) where ν , ν R d × d ν , ν R d × d nu,nu^(**)inR^(d xx d)\nu, \nu^{*} \in \mathbb{R}^{d \times d}ν,νRd×d. Table 7 reports the accuracy of SGDCT for estimating the drift and volatility functions of the CIR process.
Error/Parameter c c ccc m m mmm ( X t σ ) ( X t σ ) X t σ X t σ (sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right)(Xtσ)(Xtσ)
Maximum Error 0.0157 0.009 0.010
99% quantile of error 0.010 0.007 0.008
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error 0.0146 0.009 0.010
Mean squared error 1.49 × 10 5 1.49 × 10 5 1.49 xx10^(-5)1.49 \times 10^{-5}1.49×105 6.65 × 10 6 6.65 × 10 6 6.65 xx10^(-6)6.65 \times 10^{-6}6.65×106 4.21 × 10 6 4.21 × 10 6 4.21 xx10^(-6)4.21 \times 10^{-6}4.21×106
Mean Error in percent 0.21 0.137 0.0623
Maximum error in percent 1.12 0.695 0.456
99 % 99 % 99%99 \%99% quantile of error in percent 0.782 0.506 0.415
99.9 % 99.9 % 99.9%99.9 \%99.9% quantile of error in percent 1.06 0.616 0.455
Error/Parameter c m (sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma) Maximum Error 0.0157 0.009 0.010 99% quantile of error 0.010 0.007 0.008 99.9% quantile of error 0.0146 0.009 0.010 Mean squared error 1.49 xx10^(-5) 6.65 xx10^(-6) 4.21 xx10^(-6) Mean Error in percent 0.21 0.137 0.0623 Maximum error in percent 1.12 0.695 0.456 99% quantile of error in percent 0.782 0.506 0.415 99.9% quantile of error in percent 1.06 0.616 0.455| Error/Parameter | $c$ | $m$ | $\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right)$ | | :---: | :---: | :---: | :---: | | Maximum Error | 0.0157 | 0.009 | 0.010 | | 99% quantile of error | 0.010 | 0.007 | 0.008 | | $99.9 \%$ quantile of error | 0.0146 | 0.009 | 0.010 | | Mean squared error | $1.49 \times 10^{-5}$ | $6.65 \times 10^{-6}$ | $4.21 \times 10^{-6}$ | | Mean Error in percent | 0.21 | 0.137 | 0.0623 | | Maximum error in percent | 1.12 | 0.695 | 0.456 | | $99 \%$ quantile of error in percent | 0.782 | 0.506 | 0.415 | | $99.9 \%$ quantile of error in percent | 1.06 | 0.616 | 0.455 |
Table 7: Accuracy is reported in percent and averaged across 317 simulations. Each simulation has a different random initialization for c , m c , m c,mc, mc,m, and σ σ sigma\sigmaσ. The dimension d = 3 d = 3 d=3d=3d=3, the time step size is 10 2 10 2 10^(-2)10^{-2}102, and accuracy is evaluated at the final time 5 × 10 5 5 × 10 5 5xx10^(5)5 \times 10^{5}5×105. X t X t X_(t)X_{t}Xt is simulated using 4. Observations of the quadratic variation are generated from ( X t σ ) ( X t σ ) X t σ X t σ (sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right)(Xtσ)(Xtσ) at times t = 0 , .01 , .02 , ( X t σ ) ( X t σ ) t = 0 , .01 , .02 , X t σ X t σ t=0,.01,.02,dots(sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)t=0, .01, .02, \ldots\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right)t=0,.01,.02,(Xtσ)(Xtσ) is the quadratic variation per unit of time. For each simulation, the average error (or average percent error) for the quadratic variation per unit time is calculated by averaging across many points in the path of X t X t X_(t)X_{t}Xt. Then, the statistics in the third column of the table are calculated using the average errors (or average percent errors) from the 317 simulations.

7. American Options

High-dimensional American options are extremely computationally challenging to solve with traditional numerical methods such as finite difference. Here we propose a new approach using statistical learning to solve high-dimensional American options. SGDCT achieves a high accuracy on two benchmark problems with 100 dimensions.

7.1. Q-learning

Before describing the SGDCT algorithm for American options, it is important to note that traditional stochastic gradient descent faces certain difficulties in this class of problems. Some brief remarks are provided below regarding this fact; the authors plan to elaborate on these issues in more detail in a future work. The well-known Q-learning algorithm uses stochastic gradient descent to minimize an approximation to the discrete-time Hamilton-Jacobi-Bellman equation. To demonstrate the challenges and the issues that arise, consider using Q-learning to estimate the value function:
V ( x ) = E [ 0 e γ t r ( X t ) d t X 0 = x ] , X t = x + W t , V ( x ) = E 0 e γ t r X t d t X 0 = x , X t = x + W t , V(x)=E[int_(0)^(oo)e^(-gamma t)r(X_(t))dt∣X_(0)=x],quadX_(t)=x+W_(t),V(x)=\mathbb{E}\left[\int_{0}^{\infty} e^{-\gamma t} r\left(X_{t}\right) d t \mid X_{0}=x\right], \quad X_{t}=x+W_{t},V(x)=E[0eγtr(Xt)dtX0=x],Xt=x+Wt,
where γ > 0 γ > 0 gamma > 0\gamma>0γ>0 is a discount factor and r ( x ) r ( x ) r(x)r(x)r(x) is a reward function. The function Q ( x , θ ) Q ( x , θ ) Q(x,theta)Q(x, \theta)Q(x,θ) is an approximation for the value function V ( x ) V ( x ) V(x)V(x)V(x). The parameter θ θ theta\thetaθ must be estimated. The traditional approach would discretize the dynamics (6.1) and then apply a stochastic gradient descent update to the objective function:
E [ ( r ( X t ) Δ + E [ e γ Δ Q ( X t + Δ ; θ ) X t ] Q ( X t θ ) ) 2 ] . E r X t Δ + E e γ Δ Q X t + Δ ; θ X t Q X t θ 2 . E[(r(X_(t))Delta+E[e^(-gamma Delta)Q(X_(t+Delta);theta)∣X_(t)]-Q(X_(t)theta))^(2)].\mathbb{E}\left[\left(r\left(X_{t}\right) \Delta+\mathbb{E}\left[e^{-\gamma \Delta} Q\left(X_{t+\Delta} ; \theta\right) \mid X_{t}\right]-Q\left(X_{t} \theta\right)\right)^{2}\right] .E[(r(Xt)Δ+E[eγΔQ(Xt+Δ;θ)Xt]Q(Xtθ))2].
This results in the stochastic gradient descent algorithm:
θ t + Δ = θ t α t Δ ( e γ Δ E [ Q θ ( X t + Δ ; θ t ) X t ] Q θ ( X t ; θ t ) ) × ( r ( X t ) Δ + e γ Δ E [ Q ( X t + Δ ; θ t ) X t ] Q ( X t ; θ t ) ) . θ t + Δ = θ t α t Δ e γ Δ E Q θ X t + Δ ; θ t X t Q θ X t ; θ t × r X t Δ + e γ Δ E Q X t + Δ ; θ t X t Q X t ; θ t . {:[theta_(t+Delta)=theta_(t)-(alpha_(t))/(Delta)(e^(-gamma Delta)E[Q_(theta)(X_(t+Delta);theta_(t))∣X_(t)]-Q_(theta)(X_(t);theta_(t)))],[ xx(r(X_(t))Delta+e^(-gamma Delta)E[Q(X_(t+Delta);theta_(t))∣X_(t)]-Q(X_(t);theta_(t))).]:}\begin{aligned} \theta_{t+\Delta} & =\theta_{t}-\frac{\alpha_{t}}{\Delta}\left(e^{-\gamma \Delta} \mathbb{E}\left[Q_{\theta}\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right]-Q_{\theta}\left(X_{t} ; \theta_{t}\right)\right) \\ & \times\left(r\left(X_{t}\right) \Delta+e^{-\gamma \Delta} \mathbb{E}\left[Q\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right]-Q\left(X_{t} ; \theta_{t}\right)\right) . \end{aligned}θt+Δ=θtαtΔ(eγΔE[Qθ(Xt+Δ;θt)Xt]Qθ(Xt;θt))×(r(Xt)Δ+eγΔE[Q(Xt+Δ;θt)Xt]Q(Xt;θt)).
Note that we have scaled the learning rate in 6.3 by 1 Δ 1 Δ (1)/(Delta)\frac{1}{\Delta}1Δ. This is the correct scaling for taking the limit Δ 0 Δ 0 Delta rarr0\Delta \rightarrow 0Δ0. The algorithm (6.3) has a major computational issue. If the process X t X t X_(t)X_{t}Xt is high-dimensional, E [ Q ( X t + Δ ; θ t ) X t ] E Q X t + Δ ; θ t X t E[Q(X_(t+Delta);theta_(t))∣X_(t)]\mathbb{E}\left[Q\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right]E[Q(Xt+Δ;θt)Xt] is computationally challenging to calculate, and this calculation must be repeated for a large number of samples (millions to hundreds of millions). It is also important to note that for the American option example that follows the underlying dynamics are known. However, in reinforcement learning applications the transition probability is unknown, in which case E [ Q ( X t + Δ ; θ t ) X t ] E Q X t + Δ ; θ t X t E[Q(X_(t+Delta);theta_(t))∣X_(t)]\mathbb{E}\left[Q\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right]E[Q(Xt+Δ;θt)Xt] cannot be calculated. To circumvent these obstacles, the Q-learning algorithm ignores the inner expectation in 6.2 , leading to the algorithm:
θ t + Δ = θ t α t Δ ( e γ Δ Q θ ( X t + Δ ; θ t ) Q θ ( X t ; θ t ) ) ( r ( X t ) Δ + e γ Δ Q ( X t + Δ ; θ t ) Q ( X t ; θ t ) ) θ t + Δ = θ t α t Δ e γ Δ Q θ X t + Δ ; θ t Q θ X t ; θ t r X t Δ + e γ Δ Q X t + Δ ; θ t Q X t ; θ t theta_(t+Delta)=theta_(t)-(alpha_(t))/(Delta)(e^(-gamma Delta)Q_(theta)(X_(t+Delta);theta_(t))-Q_(theta)(X_(t);theta_(t)))(r(X_(t))Delta+e^(-gamma Delta)Q(X_(t+Delta);theta_(t))-Q(X_(t);theta_(t)))\theta_{t+\Delta}=\theta_{t}-\frac{\alpha_{t}}{\Delta}\left(e^{-\gamma \Delta} Q_{\theta}\left(X_{t+\Delta} ; \theta_{t}\right)-Q_{\theta}\left(X_{t} ; \theta_{t}\right)\right)\left(r\left(X_{t}\right) \Delta+e^{-\gamma \Delta} Q\left(X_{t+\Delta} ; \theta_{t}\right)-Q\left(X_{t} ; \theta_{t}\right)\right)θt+Δ=θtαtΔ(eγΔQθ(Xt+Δ;θt)Qθ(Xt;θt))(r(Xt)Δ+eγΔQ(Xt+Δ;θt)Q(Xt;θt))
Although now computationally efficient, the Q-learning algorithm 6.4 is now biased (due to ignoring the inner expectations). Furthermore, when Δ 0 Δ 0 Delta rarr0\Delta \rightarrow 0Δ0, the Q-learning algorithm (6.4) blows up. A quick investigation shows that the term 1 Δ ( W t + Δ W t ) 2 = O ( 1 ) 1 Δ W t + Δ W t 2 = O ( 1 ) (1)/(Delta)(W_(t+Delta)-W_(t))^(2)=O(1)\frac{1}{\Delta}\left(W_{t+\Delta}-W_{t}\right)^{2}=O(1)1Δ(Wt+ΔWt)2=O(1) arises while all other terms are O ( Δ ) O ( Δ ) O(Delta)O(\Delta)O(Δ) or O ( Δ ) O ( Δ ) O(sqrtDelta)O(\sqrt{\Delta})O(Δ).
The SGDCT algorithm is unbiased and computationally efficient. It can be directly derived by letting Δ 0 Δ 0 Delta rarr0\Delta \rightarrow 0Δ0 and using Itô's formula in ( 6.3 ) ( 6.3 ) (6.3)(6.3)(6.3) :
d θ t = α t ( 1 2 Q θ x x ( X t ; θ t ) γ Q θ ( X t ; θ t ) ) ( r ( X t ) + 1 2 Q x x ( X t ; θ t ) γ Q ( X t ; θ t ) ) d t . d θ t = α t 1 2 Q θ x x X t ; θ t γ Q θ X t ; θ t r X t + 1 2 Q x x X t ; θ t γ Q X t ; θ t d t . dtheta_(t)=-alpha_(t)((1)/(2)Q_(theta xx)(X_(t);theta_(t))-gammaQ_(theta)(X_(t);theta_(t)))(r(X_(t))+(1)/(2)Q_(xx)(X_(t);theta_(t))-gamma Q(X_(t);theta_(t)))dt.d \theta_{t}=-\alpha_{t}\left(\frac{1}{2} Q_{\theta x x}\left(X_{t} ; \theta_{t}\right)-\gamma Q_{\theta}\left(X_{t} ; \theta_{t}\right)\right)\left(r\left(X_{t}\right)+\frac{1}{2} Q_{x x}\left(X_{t} ; \theta_{t}\right)-\gamma Q\left(X_{t} ; \theta_{t}\right)\right) d t .dθt=αt(12Qθxx(Xt;θt)γQθ(Xt;θt))(r(Xt)+12Qxx(Xt;θt)γQ(Xt;θt))dt.
Note that computationally challenging terms in (6.3) become differential operators in (6.7), which are usually easier to evaluate. This is one of the advantages of developing the theory in continuous time for continuoustime models. Once the continuous-time algorithm is derived, it can be appropriately discretized for numerical solution.

7.2. SGDCT for American Options

Let X t R d X t R d X_(t)inR^(d)X_{t} \in \mathbb{R}^{d}XtRd be the prices of d d ddd stocks. The maturity date is time T T TTT and the payoff function is g ( x ) : R d R g ( x ) : R d R g(x):R^(d)rarrRg(x): \mathbb{R}^{d} \rightarrow \mathbb{R}g(x):RdR. The stock dynamics and value function are:
d X t i = μ ( X t i ) d t + σ ( X t i ) d W t i , V ( t , x ) = sup τ t E [ e r ( τ T ) g ( X τ T ) X t = x ] , d X t i = μ X t i d t + σ X t i d W t i , V ( t , x ) = sup τ t E e r ( τ T ) g X τ T X t = x , {:[dX_(t)^(i)=mu(X_(t)^(i))dt+sigma(X_(t)^(i))dW_(t)^(i)","],[V(t","x)=s u p_(tau >= t)E[e^(-r(tau^^T))g(X_(tau^^T))∣X_(t)=x]","]:}\begin{aligned} d X_{t}^{i} & =\mu\left(X_{t}^{i}\right) d t+\sigma\left(X_{t}^{i}\right) d W_{t}^{i}, \\ V(t, x) & =\sup _{\tau \geq t} \mathbb{E}\left[e^{-r(\tau \wedge T)} g\left(X_{\tau \wedge T}\right) \mid X_{t}=x\right], \end{aligned}dXti=μ(Xti)dt+σ(Xti)dWti,V(t,x)=supτtE[er(τT)g(XτT)Xt=x],
where W t R d W t R d W_(t)inR^(d)W_{t} \in \mathbb{R}^{d}WtRd is a Brownian motion. The distribution of W t W t W_(t)W_{t}Wt is specified by Var [ W t i ] = t Var W t i = t Var[W_(t)^(i)]=t\operatorname{Var}\left[W_{t}^{i}\right]=tVar[Wti]=t and Corr [ W t i , W t j ] = Corr W t i , W t j = Corr[W_(t)^(i),W_(t)^(j)]=\operatorname{Corr}\left[W_{t}^{i}, W_{t}^{j}\right]=Corr[Wti,Wtj]= ρ i , j ρ i , j rho_(i,j)\rho_{i, j}ρi,j for i j i j i!=ji \neq jij. The SGDCT algorithm for an American option is:
θ τ T n + 1 = θ 0 n 0 τ T α t n + 1 ( t Q θ ( t , X t ; θ t n + 1 ) + L x Q θ ( t , X t ; θ t n + 1 ) r Q θ ( t , X t ; θ t n + 1 ) ) × ( Q t ( t , X t ; θ t n + 1 ) + L x Q ( t , X t ; θ t n + 1 ) r Q ( t , X t ; θ t n + 1 ) ) d t + α τ T n + 1 Q θ ( τ T , X τ T ; θ τ T n + 1 ) ( g ( X τ T ) Q ( τ T , X τ T ; θ τ T n + 1 ) ) , τ = inf { t 0 : Q ( t , X t ; θ t n + 1 ) < g ( X t ) } X 0 ν ( d x ) . θ τ T n + 1 = θ 0 n 0 τ T α t n + 1 t Q θ t , X t ; θ t n + 1 + L x Q θ t , X t ; θ t n + 1 r Q θ t , X t ; θ t n + 1 × Q t t , X t ; θ t n + 1 + L x Q t , X t ; θ t n + 1 r Q t , X t ; θ t n + 1 d t + α τ T n + 1 Q θ τ T , X τ T ; θ τ T n + 1 g X τ T Q τ T , X τ T ; θ τ T n + 1 , τ = inf t 0 : Q t , X t ; θ t n + 1 < g X t X 0 ν ( d x ) . {:[theta_(tau^^T)^(n+1)=theta_(0)^(n)-int_(0)^(tau^^T)alpha_(t)^(n+1)((del)/(del t)Q_(theta)(t,X_(t);theta_(t)^(n+1))+L_(x)Q_(theta)(t,X_(t);theta_(t)^(n+1))-rQ_(theta)(t,X_(t);theta_(t)^(n+1)))],[ xx((del Q)/(del t)(t,X_(t);theta_(t)^(n+1))+L_(x)Q(t,X_(t);theta_(t)^(n+1))-rQ(t,X_(t);theta_(t)^(n+1)))dt],[+alpha_(tau^^T)^(n+1)Q_(theta)(tau^^T,X_(tau^^T);theta_(tau^^T)^(n+1))(g(X_(tau^^T))-Q(tau^^T,X_(tau^^T);theta_(tau^^T)^(n+1)))","],[tau=i n f{t >= 0:Q(t,X_(t);theta_(t)^(n+1)) < g(X_(t))}],[X_(0)∼nu(dx).]:}\begin{aligned} \theta_{\tau \wedge T}^{n+1} & =\theta_{0}^{n}-\int_{0}^{\tau \wedge T} \alpha_{t}^{n+1}\left(\frac{\partial}{\partial t} Q_{\theta}\left(t, X_{t} ; \theta_{t}^{n+1}\right)+\mathcal{L}_{x} Q_{\theta}\left(t, X_{t} ; \theta_{t}^{n+1}\right)-r Q_{\theta}\left(t, X_{t} ; \theta_{t}^{n+1}\right)\right) \\ & \times\left(\frac{\partial Q}{\partial t}\left(t, X_{t} ; \theta_{t}^{n+1}\right)+\mathcal{L}_{x} Q\left(t, X_{t} ; \theta_{t}^{n+1}\right)-r Q\left(t, X_{t} ; \theta_{t}^{n+1}\right)\right) d t \\ & +\alpha_{\tau \wedge T}^{n+1} Q_{\theta}\left(\tau \wedge T, X_{\tau \wedge T} ; \theta_{\tau \wedge T}^{n+1}\right)\left(g\left(X_{\tau \wedge T}\right)-Q\left(\tau \wedge T, X_{\tau \wedge T} ; \theta_{\tau \wedge T}^{n+1}\right)\right), \\ \tau & =\inf \left\{t \geq 0: Q\left(t, X_{t} ; \theta_{t}^{n+1}\right)<g\left(X_{t}\right)\right\} \\ X_{0} & \sim \nu(d x) . \end{aligned}θτTn+1=θ0n0τTαtn+1(tQθ(t,Xt;θtn+1)+LxQθ(t,Xt;θtn+1)rQθ(t,Xt;θtn+1))×(Qt(t,Xt;θtn+1)+LxQ(t,Xt;θtn+1)rQ(t,Xt;θtn+1))dt+ατTn+1Qθ(τT,XτT;θτTn+1)(g(XτT)Q(τT,XτT;θτTn+1)),τ=inf{t0:Q(t,Xt;θtn+1)<g(Xt)}X0ν(dx).
L x L x L_(x)\mathcal{L}_{x}Lx is the infinitesimal generator for the X X XXX process. The continuous-time algorithm 6.7) is run for many iterations n = 0 , 1 , 2 , n = 0 , 1 , 2 , n=0,1,2,dotsn=0,1,2, \ldotsn=0,1,2, until convergence. See the authors' paper 28] for implementation details on pricing American options with deep learning.
We implement the SGDCT algorithm (6.7) using a deep neural network for the function Q ( t , x ; θ ) Q ( t , x ; θ ) Q(t,x;theta)Q(t, x ; \theta)Q(t,x;θ). Two benchmark problems are considered where semi-analytic solutions are available. The SGDCT algorithm's accuracy is evaluated for American options in d = 100 d = 100 d=100d=100d=100 dimensions, and the results are presented in Table 8
Model Number of dimensions Payoff function Accuracy
Bachelier 100 g ( x ) = max ( 1 d i = 1 d x i K , 0 ) g ( x ) = max 1 d i = 1 d x i K , 0 g(x)=max((1)/(d)sum_(i=1)^(d)x_(i)-K,0)g(x)=\max \left(\frac{1}{d} \sum_{i=1}^{d} x_{i}-K, 0\right)g(x)=max(1di=1dxiK,0) 0.1 % 0.1 % 0.1%0.1 \%0.1%
Black-Scholes 100 g ( x ) = max ( ( i = 1 d x i ) 1 / d K , 0 ) g ( x ) = max i = 1 d x i 1 / d K , 0 g(x)=max((prod_(i=1)^(d)x_(i))^(1//d)-K,0)g(x)=\max \left(\left(\prod_{i=1}^{d} x_{i}\right)^{1 / d}-K, 0\right)g(x)=max((i=1dxi)1/dK,0) 0.2 % 0.2 % 0.2%0.2 \%0.2%
Model Number of dimensions Payoff function Accuracy Bachelier 100 g(x)=max((1)/(d)sum_(i=1)^(d)x_(i)-K,0) 0.1% Black-Scholes 100 g(x)=max((prod_(i=1)^(d)x_(i))^(1//d)-K,0) 0.2%| Model | Number of dimensions | Payoff function | Accuracy | | :---: | :---: | :---: | :---: | | Bachelier | 100 | $g(x)=\max \left(\frac{1}{d} \sum_{i=1}^{d} x_{i}-K, 0\right)$ | $0.1 \%$ | | Black-Scholes | 100 | $g(x)=\max \left(\left(\prod_{i=1}^{d} x_{i}\right)^{1 / d}-K, 0\right)$ | $0.2 \%$ |
Table 8: For the Bachelier model, μ ( x ) = r c μ ( x ) = r c mu(x)=r-c\mu(x)=r-cμ(x)=rc and σ ( x ) = σ σ ( x ) = σ sigma(x)=sigma\sigma(x)=\sigmaσ(x)=σ. For Black-Scholes, μ ( x ) = ( r c ) x μ ( x ) = ( r c ) x mu(x)=(r-c)x\mu(x)=(r-c) xμ(x)=(rc)x and σ ( x ) = σ x σ ( x ) = σ x sigma(x)=sigma x\sigma(x)=\sigma xσ(x)=σx. All stocks are identical with correlation ρ i , j = .75 ρ i , j = .75 rho_(i,j)=.75\rho_{i, j}=.75ρi,j=.75, volatility σ = .25 σ = .25 sigma=.25\sigma=.25σ=.25, initial stock price X 0 = 1 X 0 = 1 X_(0)=1X_{0}=1X0=1, dividend rate c = 0.02 c = 0.02 c=0.02c=0.02c=0.02, and interest rate r = 0 r = 0 r=0r=0r=0. The maturity of the option is T = 2 T = 2 T=2T=2T=2 and the strike price is K = 1 K = 1 K=1K=1K=1. The accuracy is reported for the price of the at-the-money American call option.
We recall the following regularity result from 24] on the Poisson equations in the whole space, appropriately stated to cover our case of interest.
Theorem A.1. Let Conditions 2.2 and 2.3 be satisfied. Assume that G ( x , θ ) C α , 2 ( X , R n ) G ( x , θ ) C α , 2 X , R n G(x,theta)inC^(alpha,2)(X,R^(n))G(x, \theta) \in C^{\alpha, 2}\left(\mathcal{X}, \mathbb{R}^{n}\right)G(x,θ)Cα,2(X,Rn),
X G ( x , θ ) π ( d x ) = 0 X G ( x , θ ) π ( d x ) = 0 int_(X)G(x,theta)pi(dx)=0\int_{\mathcal{X}} G(x, \theta) \pi(d x)=0XG(x,θ)π(dx)=0
and that for some positive constants K K KKK and q q qqq,
i = 0 2 | i G θ i ( x , θ ) | K ( 1 + | x | q ) i = 0 2 i G θ i ( x , θ ) K 1 + | x | q sum_(i=0)^(2)|(del^(i)G)/(deltheta^(i))(x,theta)| <= K(1+|x|^(q))\sum_{i=0}^{2}\left|\frac{\partial^{i} G}{\partial \theta^{i}}(x, \theta)\right| \leq K\left(1+|x|^{q}\right)i=02|iGθi(x,θ)|K(1+|x|q)
Let L x L x L_(x)\mathcal{L}_{x}Lx be the infinitesimal generator for the X X XXX process. Then the Poisson equation
L x u ( x , θ ) = G ( x , θ ) , X u ( x , θ ) π ( d x ) = 0 L x u ( x , θ ) = G ( x , θ ) , X u ( x , θ ) π ( d x ) = 0 L_(x)u(x,theta)=G(x,theta),quadint_(X)u(x,theta)pi(dx)=0\mathcal{L}_{x} u(x, \theta)=G(x, \theta), \quad \int_{\mathcal{X}} u(x, \theta) \pi(d x)=0Lxu(x,θ)=G(x,θ),Xu(x,θ)π(dx)=0
has a unique solution that satisfies u ( x , ) C 2 u ( x , ) C 2 u(x,*)inC^(2)u(x, \cdot) \in C^{2}u(x,)C2 for every x X , θ 2 u C ( X × R n ) x X , θ 2 u C X × R n x inX,del_(theta)^(2)u in C(XxxR^(n))x \in \mathcal{X}, \partial_{\theta}^{2} u \in C\left(\mathcal{X} \times \mathbb{R}^{n}\right)xX,θ2uC(X×Rn) and there exist positive constants K K K^(')K^{\prime}K and q q q^(')q^{\prime}q such that
i = 0 2 | i u θ i ( x , θ ) | + | 2 u x θ ( x , θ ) | K ( 1 + | x | q ) . i = 0 2 i u θ i ( x , θ ) + 2 u x θ ( x , θ ) K 1 + | x | q . sum_(i=0)^(2)|(del^(i)u)/(deltheta^(i))(x,theta)|+|(del^(2)u)/(del x del theta)(x,theta)| <= K^(')(1+|x|^(q^('))).\sum_{i=0}^{2}\left|\frac{\partial^{i} u}{\partial \theta^{i}}(x, \theta)\right|+\left|\frac{\partial^{2} u}{\partial x \partial \theta}(x, \theta)\right| \leq K^{\prime}\left(1+|x|^{q^{\prime}}\right) .i=02|iuθi(x,θ)|+|2uxθ(x,θ)|K(1+|x|q).

9. References

[1] Y. Ait-Sahalia, Maximum Likelihood Estimation of Discretely Sampled Diffusions: A Closed-form Approximation Approach, Econometrica, Vol. 70, No. 1, 2002, pp. 223-262.
[2] Y. Ait-Sahalia, Closed-form likelihood expansions for multivariate diffusions, Annals of Statistics, Vol. 36, No. 2, 2008, pp. 906-937.
[3] A. Alabert and I. Gyongy, On numerical approximation of stochastic Burger's equation, From stochastic calculus to mathematical finance, Springer Berling Heidelberg, 2006. 1-15.
[4] A. Alfonsi, High order discretization schemes for the CIR process: application to affine term structure and Heston models, Mathematics of Computation, Vol. 79, No. 269, 2010, pp. 209-237.
[5] R. Ahlip and M. Rutkowski, Pricing of foreign exchange options under the Heston stochastic volatility model and CIR interest rates, Quantitative Finance, Vol. 13, No. 6, 2013, pp. 955-966.
[6] A. Barto, R. Sutton, and C. Anderson, Neuronlike Adaptive Elements that can solve difficult learning control problem, IEEE Transactions on Systems, Man, and Cybernetics, Vol.5, 1983, pp. 834-846.
[7] I. Basawa and B. Rao, Asymptotic inference for stochastic processes, Stochastic Processes and their Applications, Vol.10, No. 3, 1980, pp. 221-254.
[8] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, 2012.
[9] Dimitri P. Bertsekas and John N. Tsitsiklis, Gradient convergence in gradient methods via errors, SIAM Journal of Optimization, Vol.10, No. 3, 2000, pp. 627-642.
[10] J. P. N. Bishwal, Parameter estimation in stochastic differential equations, in: Lecture Notes in Mathematics, Vol. 1923, Springer Science & Business Media, 2008.
[11] S. Brown and P. Dybvig, The empirical implications of the Cox, Ingersoll, Ross theory of the term structure of interest rates, The Journal of Finance, Vol.41, No. 3, 1986, pp. 617-630.
[12] Q. Dai and K. Singleton, Specification analysis of affine term structure models, The Journal of Finance, Vol. 55, No. 5, 2000, pp. 1943-1978.
[13] A. Davie and J. Gaines, Convergence of numerical schemes for the solution of parabolic stochastic partial differential equations, Mathematics of Computation, Vol. 70, No. 233, (2001), pp. 121-134.
[14] O. Elerian, S. Chib, and N. Shephard, Likelihood inference for discretely observed nonlinear diffusions, Econometrica, Vol. 69, No. 4, (2001), pp. 959-993.
[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Book in preparation for MIT Press, 2016.
[16] H. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second Edition. Springer, 2003.
[17] Y. Kutoyants, Statistical inference for ergodic diffusion processes, Springer Science & Business Media, 2004.
[18] V. Linetsky, Computing hitting time densities for CIR and OU diffusions: applications to mean-reverting models, Journal of Computational Finance, Vol. 7, 2004, pg. 1-22. [19] T. Leung and X. Li, Optimal mean reversion trading with transaction costs and stop-loss exit, International Journal of Theoretical and Applied Finance, Vol. 18, No. 3, 2015, pg. 155020.
[20] T. Leung, J. Li, X. Li, and Z. Wang, Speculative futures trading under mean reversion, Asia-Pacific Financial Markets, Vol. 23, No. 4, 2015, pp. 281-304.
[21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation to stochastic programming, SIAM Journal of Optimization, Vol.19, No. 4, 2009, pp. 1574-1609.
[22] Y. Maghsoodi, Solution of the extended CIR term structure and bond option valuation, Mathematical Finance, Vol. 6, No. 1, 1996, pp. 89-109.
[23] E. Pardoux and A.Yu. Veretennikov, On Poisson equation and diffusion approximation 1, Annals of Probability, Vol. 29, No. 3, 2001, pp. 1061-1085.
[24] E. Pardoux and A. Y. Veretennikov, On Poisson equation and diffusion approximation 2, The Annals of Probability, Vol. 31, No. 3, 2003, pp. 1166-1192.
[25] M. Raginsky and J. Bouvrie, Continuous-time stochastic mirror descent on a network: variance reduction, consensus, convergence, IEEE Conference on Decision and Control, 2012.
[26] B. L. S. P. Rao, Statistical inference for diffusion type processes, Arnold, 1999.
[27] G. O. Roberts and R.L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 1996, pp. 341-363.
[28] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, 2017, arXiv: 1708.07469.
[29] O. Vasicek, An equilibrium characterization of the term structure, Journal of financial economics, Vol. 5, No. 2, 1977, pp. 177-188.

  1. *Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, Urbana, Email: jasirign@illinois.edu
    ^(†){ }^{\dagger} Department of Mathematics and Statistics, Boston University, Boston, E-mail: kspiliop@math.bu.edu
    ^(‡){ }^{\ddagger} Research of K.S. supported in part by the National Science Foundation (DMS 1550918)
    § § §\S§ Computations for this paper were supported by a Blue Waters supercomputer grant.
  2. 1 1 ^(1){ }^{1}1 Let r e , t r e , t r_(e,t)r_{e, t}re,t be the reward for episode e e eee at time t t ttt. Let R t , e = t = t + 1 T end of episode γ t t r e , t R t , e = t = t + 1 T end   of episode  γ t t r e , t R_(t,e)=sum_(t^(')=t+1)^(T_("end ")" of episode ")gamma^(t^(')-t)r_(e,t^('))R_{t, e}=\sum_{t^{\prime}=t+1}^{T_{\text {end }} \text { of episode }} \gamma^{t^{\prime}-t} r_{e, t^{\prime}}Rt,e=t=t+1Tend  of episode γttre,t be the cumulative discounted reward from episode e e eee after time t t ttt where γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1]γ[0,1] is the discount factor. Stochastic gradient descent is used to learn the parameter θ μ θ μ theta^(mu)\theta^{\mu}θμ : θ μ θ μ + η e R t , e θ μ log μ ( s t , a t , θ μ ) θ μ θ μ + η e R t , e θ μ log μ s t , a t , θ μ theta^(mu)larrtheta^(mu)+eta_(e)R_(t,e)(del)/(deltheta^(mu))log mu(s_(t),a_(t),theta^(mu))\theta^{\mu} \leftarrow \theta^{\mu}+\eta_{e} R_{t, e} \frac{\partial}{\partial \theta^{\mu}} \log \mu\left(s_{t}, a_{t}, \theta^{\mu}\right)θμθμ+ηeRt,eθμlogμ(st,at,θμ) where η e η e eta_(e)\eta_{e}ηe is the learning rate. In practice, the cumulative discounted rewards are often normalized across an episode.